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This resaarch^has investig^^^^ f easlbllltyNof using a distance ^ r ' 

measure , called the Saye^ian 'distance , for ;au tomat Ic" sequential ddcui^enjt # ■ ;/ v 
elasatf icatibhl . It* has been, shown that by observing the variation *of this 
distance measure ^as^keywords ^are .exkr^ated;' sequentially from a^.d the . 

occurrence of -%o±sy~ k&y%^QTdB tnay be; datected:^.^^^ 
measure has been utilised to desi^ a sequeh'tlal /classtf ication algorl 
'Which works in two phases In the: Jlrst phase kaywords' ekt^ractad fro 
are partitioned ; into . two, groups the^good keyword group and.^the noisy ke3^ord ' 
group* In the second phase these two groups of. keywords* are analyzed separately 
^to assign.' primary , and secondary classes to a document* Th© algorithm has teen ^ 
applied to the SPIN 'data base :and very encouraging "rfesult ha^e been oljt^ined. 
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CHAPTEH I 



INTBODUGTION 

^^^^ purpose Of doGimtnt clagsifleatlcn le to aid the proeass 

of irifomatlon retrieval from an infomatlbn- syatem.^ Sueh a syetem ' 
may contain a colleation of written teKtev^tooks, summariei, atetractSj 
titles and eo on^ each of which is considered to he a.docianent. A 
user of this eyatem typically fomulatee a ire^uest in natural language 
deeorihing the suhJaQt area or areas in i/hlch he aeeka infQrmatj:oq.^ 
After such •a request is formulated^ the document collection is 
Bearehed and all. itemi coniidered to be relaVMt^td the user 'a needs 
Me retrieved. 

/ The taik of identifying docmente vhlch we relevajit to a given 
req^ueit ie a complex one and is usually done by comparing the contents 
of a documeht with that of the iemrch- request, . This, means that each ^ 
doci^ent. in the collection has Jo^ be analysed for, its content 
repreiented by means of a set of content identifiers. These/ content , . 
identifiere, also referred to as attributes » are ■usually a set of 
distinguishing wordSs phrases or sentences that together describe j, . 
the c^plete ^ea of discourse of the entire dpc^ent cbllectloriv^^^^ ^ " 
The proeess of analyzing a document and repreaenting it Ih^terma, % 
of one. or more of the above types of attributes is khown. ^s content . . 



. aiialysii- It asaumes ttoee. baeic formi^ — abitraotingj indexing and 
■plasiif icatlon* AbBtracting/procedures generally use aenteneei . ai 
cohtent Identifiers ^ while ■ ind#xi% 'and qlaieif ica^ techniques use 
phrases* and words, : / " : 

■ . Aa. opposed to the abitracting operations where tht : abstraQts 
prodUGed are stored sequentially, by item identifier huraber ^ indexing 
/and claisif leation generally presume an additional Q^eration ofv,file 
. inveriion, The index term or classification codes ara first ordered 
^ in some fashion, and cQ^responding to each index term or claesif icatlon 
code a list of document Identifiers is maintained. This is called ^ 
an inverted file. As a result ^ now instead -of searchirig throu^tf an 
entire document .'oblleetion I only the appropriate attribute lists in 
the Inverted file need be exafflined in order to "extract documents ■ 
relevant to a giyenr query. Thus^ bagidally^ both indexing j^d classi- 
fication partition a data bap e. into groups. There is 5 however 5 - _ 
a major difference ^igtween the two,; ^ . :* . ' 

The technique of indexing induce! a partition on a' dociment 
collection which is very fine in natUrev The entire data base is,.^ 
divided into' as many groups a.s ^ there are attributfea* Classification^ 
.on tht other hand, obtaitis a coarser structure. The ^entire file 
of documents . is broken up into a' much smaller number of groups, where 
eac!^ group contains documents., whose attributes are similar to each 
other,.. More fomallys the basic document olasslfieation problara 
coiisists of cat.egortMng a set of , dpoaments 'by asiigning -them to*a = 



^eagona"bly small n^ter . of eubpopulatidns in such a way that the 
]pamters. within each group, are sufficiently alike 'to justify -Ignoring 

the individual .differenets "between them.' . v, | - 

S \ ' ^- ■ ^' ^ : ■ " , ' 

Most content analyeis methqde i todays Including claa^ if i^ 

cation of lafge. information systems ^ are manual in nature. In " 

practice .this hae .pro-i^ed to. be very "time consuming and expensive -do 

"operate* Therefore it is desirable to %automate the entire pro'^sese 

of content analyeie* T^fs. ree^eareh concentrates on devising an ^ 

efflqient aUtomatlQ technique for the ' classification 'aapect of 

coi^^ent analysis^ ^ ' * - : . ' / " ; ' . 

^ In general^ automatic document claislfieation requires the . 

fbiiowing eete of operations. . . ■ - ! 

, . (i) A set of categories or claeies* which form a deeired 

' partition of the ^tlre document baee. is chosen. v , 

(li) A set of= H^ywords which ajre moat repreeentative: of ^ 

V V thB documents in the collection Is a&leoted* - - ~ ' 

(ill)^An algorithm .is designed^ which given the .description 

.^.'^ of .a document in terms of its ^e^ordSj will ^ aBle 

■ ' to assign that dociament to one »or more of ^the > 

predeterinined set of classes/ ,^ ' 

This research assiimes the preeence'of a pre^etemined set of * 

categories ^ and keywords, . It concentrates mainly. onLthe third ^ 

/point 5 vIe. j -on the design of efficient methoda for assigning/a 

document to one or more categories based on its description. 



area oif: research for a 'long time 4 t)ne of the earliest and .most... 
' signWloaBt opn;tri^^ that of LuHn .[181 who" shOTod, that 

; .statiatleal #,nalys±r of the Vords in a docment aaa provide some 
:clues as tp ii^ content;' Thi led'to the development of several 

autpmatio. qiassifiQation met from the eai'ly 1960*s to the early 
^9.70; s^- a niimber of which>ill be aisQuised in some ditail in Chapter 
II, The baiic approach used hy all ^the^e mttho sUch that each . 

document, must; be read entirely before it can be jassigned to eomt = 
''category, - . / / ; /^"^ 

: ^ This. research aasiuoaer The baslq. 

' phllQBophy/OTployed hfere is based on the Jremlsa that it may not bs. \ 
neceaea^y to-'examlne a document entirely befcft^e some indication of - 
its Bubjegt opnteni can be obtainaf. If documtehts be ; suitably 
^caassified by > examining; only limited pdrtlons ;theHofV then coh^ 
erable time and money spent - on prooeeilng entire documents ' pan be 
sayedv\ This realisa^t ion. has led to the developrianrt and implMentatioa 
of a basic s^tuential teehniqvLe for; autoinatle aoament^laisifloatipn. 
In this, method kesnfords ate extracted sequtntlaiay from a aocument, . 
and at each stage a itatistical prediction technlciiie is uied to 
determ^^e yhether, or not the document can be elassifieav If not , 
then more of it is read. Detaili of the basic sequential inethod 
itfe feasibility and its ^ limitations are dieousied" In Chapter III. 
Specifically J it Is pointed out that thibasie sequential ' / 



- i^pineensitlv^ to tlie oce^u^rence of 'noisy' or inapgropriate^ke^wDris 
'and lacks ^he capability gf systeniatic>g,;Ll7' assigning a dociment to;'/^ 
more than on^e category. j^^: : * . , 

, Chapter IV addreBsei itself these two pro'bieai: and discus sas 

the need for -a measure .which \^ill be aMe to identify clusters of 
similar keyvords* Such a meaBure^ called the Bayesian distances is . 
then defined on a vector space represen^tatiort of keyvrords^ . Chapter V 
. explore & Jhe poBsibility of usljig the Bayesian distance foi* the 
pur^poaes of automatic 'document classification- /Iti^ is ahdvn that hy 
studying the . variation of its magnitude-and direction: aa keywords 
are jead from a dociments ■ noisy words may Ise -isolated Md cluBt^^ 
pf "^eii^ilar keyvords; can be ^ identified* ;These elustere Of eimilar 
ke^Hfcrds ' ar e such that they relate .the document to different claBSes. 
n , : Chapter^'VI usee the properties of the Deyesian distance to . 
design :a classification technique vhich is c^allad the revised . - 
sequantia^ algorithiii, This algori^M ojei^ates ty first' identifying . 
,t¥o groiaps of ic^icto3?ds--the^^g set and the noisy keyvord ' 

set* It theni ahalyEes the group of good keywords to obtain 
primary, class for a doeinient. \ Chapter VII deals with the desigri 
of a method which \isei/ the Bayesian distance measure to analyse the 
keywords contained in tke npisy group^^;4-"'if 'tliese keywords ^ are such 
tliat they indicate dn additional Gla3W^t:6'°'.i/frich the document may 
be assigned 5 then 'thla class is denoted as the secondary class. 



• .6 

f ■■ 



, . : Chapter. VIII presents a .euMiaty of "the basic achlevementa - ^ 
of this i-esear^ch and identifies several related areas in '"Which 

' .. • .. ■ \ 

further r^e^earch cotild be 'pursued. . 
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. In tiie laet Qhapt*er " the! problem of atitomat^^^ ^ , 

eatlon ^ae introduced, as one aspect' of the largeir prohlem^of automatio 
content analyats. Automatic dpcvment Glassification was defined as 
*;the^ ptoceig of Qategorlstng a set of docmnentB hy assigning thdtij to. a 
' gre^^ Buhpbpulations'br groiips. T^e nimber gf 

such groups obtained is dictated ty the rfequiraments demanded of the 
deaigner of the elassificatibn aystm, A la3E|ger numher Vill be needed 
w^hen a yery jfipe dfitinction is .requ^ dQcuments*' A 

small ntmb^r suffices when the distinction ,yLfeed ph3^ be/coaJ'ser A . 
combination cf coarsie and fine distinction may be obtained by defilgnipg 
^:'hlera^chleal system and then operating at a\sulta"ble level of the 
hiararchy." ^ 

/ this researfih aeaifflies that the niariber of subpopulatlbnB or 
groupa into yhieh a document QOllectlon is to be parti^tioned is aypLilable . 
Each 'bf tliese groups s " alio ^ refeyred to; as^^^.^^ □ategory', denotes. ,,a , 
distinct subject area and together they describe the entire a^ea of ^ 
<p:lBcqm^se of the total document collet tlori. ^^he problem , then is/to^.;.- 
design methods vhieh will automatically examine each document and, 
based on its. content, dssign it to one or .more of these' categories,- ' 
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The classificatiog^^cci^cy then %b estimated by comparing the ^ ■ 
rfeeultB :jfith those obtaine^by uelng%anual mfetHods . - ■ ■ -v. " ^ 

In, order that the oi^tcome of this, research may be^ evaluated ^ it ' - ■ 

■ : ' ^ ^ •■ • . 1.-^' / ■ : : ! ■ ■ . \^ • ' ■ = = ^ . . ' . 

■ is'necesskry to take a cMtical:.look at some of the more well known - ' 

, existing, autoBiatlc cJ.aBslflcatlon; syptaas. This chapter dlaGUiBea^the ^: 

; advantages arid limitations of these syetansr 'since each of the 

. sye tains diesus^ed^uees its oto deJi-nitions and concepts, comparisoh " - ^ 

becomes difficult imless, a standardised . set Qf -de used . 

She following, section attempts '^to do this briefly^ ■ ; : \' . 

*■ ' 2,1 Basic .Concepts and jpfeflnltlons ^ . ^ ^ 

= ' DQCuments ; For the purpb^s ./pf th^^^eiearcKTa^^d 
. . j3onsldM&d to be any. item in the forin of an abstract , an= article or ^ ' ; 
any other coherent body of text, A document data base will be denoted 

. by the set .D where ; _ . . - ^ ' " ■ ! V ^ 

represents a collection of n docUTnehtA* ^ a ' . 1^ • = 

Categoriei! :. Give^nv a set of documents D ^ the number of groups into 
\vrMch the set is to be divlded^by the classifrcatloh proceas; is first' 

. determined. E^ol^vsuch group will be referred to^^as a' category. The ; 

^ 0t of catfeg6rles will be denoted by C ^ {C. V C^s i^^^* , C^.}* ; The " ^ * 
dociiments cdritained within each category, wljk be ^ufficifhtly 
tjheir ^suiject content ,to^ gust individual differencea 
betweeri| thein,\ ; - - / 
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Keywords ; |t' was pointed out In^ Chapter Iiltliat in order to ' : 
• .;a.chieV-e-,classififcation each docuflient in the set D has to be" analysed 
for its content /and then repjcesentad by means of a set of content i4enti- 
fierSi These cpntent identifiers, : also referred to Bfi attrihuteSj ' 
-are usually a get *of distinguished ^words , phrases/vor^ sentences- that - 



. together describe the complete .a^e^ of discoi^se -bf tha ent-ire docu- 



v.nient collection. This research 5 like most other automatie document 
^"^^'thods j u a, set of distinguishgd words to a^t as ; = 
content ^idaniif iers>.f These distiniu^lrcd wor&fe will be dialled keyword 
Se'lection Qfl an appropriate set 'of ke3n^ords Is an area of : research- in " 
its oVn right and has receiyed a great deal. of \ attenti^rn in the literature 
{22532] * %his resear p^vkllable. It 

will be dedoted'-hy.K ^ {k^^;^, k .-Eabh keyword 'k\ contained " 



ln_. K relat 
p^e^enj^v. i^ 



'•B a given document, to one mx more of the categories 
the set. C defined'' abo^*; Ther.^xact way in which this = 
relationship, between .a ke3rword and. a category is achitvid " will be ^ 
disGupsed Later dn this chapter * ' ^ - : } y 

I U^iniS a set o%^opuments-^ a set /of cat egories $nd a set;. bf - . 
./ke^nfords m the starting pointj^^v the foiiowing sections in this chapter 
: dls'quis a iiu|nlDer of automatic □lassifieatlonrmeth^^s that are currently 4a a 
use , or have' been used in the ^ast, Baa#d on the gentml philosophy 
uped Iji these Qlasslfication raethiod^^^ tifey hav^ into two^ 

broad groups,- viz* j statistical elafislfiaatlon techniquas and classlfi- 
ca^on techniques based on clustering methoclS. . ■ 
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2 . 2^ ^ ^tCs tical^^ Methods of, Aut'bniatic ttpc^cument -Claesif icatidq, 
^ . The possibility of characterizing fha subgect mattar of a docu- 

%meni byMneans of automatic content aniXysia was recognised in hthe 

" » J r ^ : ' ■■ \.- ■ f"^ ^ ■ f ■ * 

^ earlv 1950^6 but ^ it remained a rela-^ively imcharted atea mitil Luhh [18] 
^shawAd that/ai statist-ical analysis of the. words 'in a document ■ would 
promde some clues as its content. . After pioneering work 
; several vautomajtic docuiTient classification tnethQas h^ye been develope 
iitaptlng ftrom ^±hi^ 1960 *s to the €aj*ly 1^70* ' ^he basic 
apw^oach of all thes^ methods is the "same r\ fikphvdocuaen 



! Q^^|lsiflfe4 .intOvbne^of the given aattgdrles ts read entirely 
and all the keysfords present In 'it are eKtracttd; ^. Uiing these. " 
kejrwords a prediction function is used to relate 'the document to each. ; 
of the categg^e^; The differences in the various methods .lie >in the ; 
nature of the Wedlct-ion #ui^^ that is. ilsed and the way the . relation- 
'- ship between a,| category and a document is computed. These differinces 
and similarltias will be clarified when the various methods are :■ 
discussed in' SO ne detail in the following sectloris. ^ ^ 

2,2 *h Bayesiy ^echniquei . r ^ ^ ^ ^ ■ - 

^ ^3^011119.1 applied a atatistlcal methbd to the problOT of . ' ■■ ' 
automatio claselflQation which involved i . . I I : 

(i) the determination of certain prob^illty r 

between individual keywords arid subject: categories ^ and ^ ; 
(il) the use of these relA^lonshipa to predict the ^category to 

■ . ' ^ /■ 'C . ■ 

. which a document belongs^by using Bay as rule. 
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. The prediction' method that he used was as, follows; Given a set 
Of cttegories C ^C^^.^^^C' arid'^a dbcument vhl/ch conlains only 

;-^;one elue vord k^^.the pr^ba^mity that t|ie d^cment |i 

each of the categorieB, is computed ut^ng ^ ./ • ^ 

■ ' ■ ■ ' " • ^ ■ ■■ '■ "// ' - : \^ \ 

•, - nc./k.):^ , J - ^ J- ' , . . • : / . . , (2.1) 

^^^^ ■ ^ ' ^ . ■ ■■ . J . ^ ' : ^- ^ 

^ EKtension to the ease of Hocuments icohtaining' mora than one clue word 

was made assuming keyword independence; / , ^ = ■ 

^ /_ ^ " • ^ ]' ^ ■ : ■■■ -^^ ■ 

^ . if. For eKperimentation. piirposes Ma^on ch^ 405 abstracts of uciomputar 
literatiire puHlishfed In the IRE trtrisadtions on electronic Computei*Sp • . ' 



MM'eh 1959* He selected 32 categories katiHally , and used 260 of. the UOJ- ' 
abstract^^ as /.sample documents from ^which keyword fr^quenc statistics 
were obtained. Using a set of 90 keywords, he achieved a cllssif ication 



accuracy fef about 501 over the entire /let 1 of documents, 

The^ classification technl(p.eafCo^i retfearch ^ear 



resemblance^ to Maron^s method^ in th^t the ^:eyword cla^s frequencies. 




calculated using sample documeijrts and the a po'^steriori probabil-r 
ities of the classei^ after__„a' numbep #f keywords ifire I'ead and 'computed 



, using Bayes rule. The basic dlfmrence lies in t^^, this research 
hypothesi^as tjiat the keywords prisfnt in il doc\;ment need not.be. 
examined in toto before claBS lumber ship can be detemined* but need^' 
Only be read sequentially unti4 a decision is reached. The 

. philosbphy "behind such .,a sequential teehrilque will be explloated 
in greater detail 'in Chapter III, * v .. # . 
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2/2,2 Teohniqi|ba Based on Matrix ManipulatiQn ' ^ ^ 

/ Subsequent to the publication of M^ron^s^i^ 

a technique fbr automatic doGument ^^^^^ based on jEaetdr 

^ ^. , ^ ' - ' - ■ - --. k ■ "^'.^ ■•. ^ ^ ^ 

analysis using MaronVs 405 abstraetl of fcSmputer literature.. By 

means of a computer progrMHi^ Gounts ware made ^tf the number of times ■ 

; each of the 90 keywords oci^jurred irii each of 'the documents: In a 260/. ■ 

document' sample set* Using this dala a, 90 k 90 keyword correlation 

matrix was derived'. Jhls matrix .Was then .factor analysed 5 as a - 



; resiBtsjQf subject categories wotc Identified. A prediction 

technique ha/sed on the k^^rtrord frequencies Ip the docuffients 
arid their faotor loadings was developed* 'Suppose T, den^ the 
number of occurences of keyworSi<r in a ^d^ L.. denotes the ' 

normall:^ed factor loading of ke3mQrd k_. for:oategory C,*' Then a value 
Pj Is calculated for each category Cj as follows 



■ / 



p. ^ L.. T.: + L.. + L: . r^. , . (2.2) 

1 . Ij 1 ; 2] 2 \ : ■ . m .k-- ^r: ^ 

The document is classified in a classy having the ^hlghest value; 
of P.. Using this technique a clagsiMcation accuracy of about^ ' 
48% was obtained, jft®' ;.. 

; : ) At about the same time as BSrkos Williams -t29i3Q devised a M 
classification technique based on discriminant analysis. Instead of 
computing factor loadings, he computed a discriminant^ coefficient 
for each keyword and category. Using a set of 420 solid state * 
abstracts published by thf Cambridge CommuriiGatlQns Corporation and 
a set of 48 .kty^rds j he first classified 120 documents manualiyi^ Into. ^ 



three d^tegories*^ The fretue'ncy of OGCurrence of a key^ each 



dociunent heloriging to a given category is then empiricaiiS^ obtained. 
Using tHese frequencies two m k rn matMcea are Gpmputed as followsj 
M matrix whose elementa are the pooled within*-aategory/sum 
^ • of squares ^ ., \ ■ ' 

, A i a matWx T^hp'p'e elemants are the amQngf-eategb:^,iiffii of squares * , ' 
Suppose now a doci^ei^;;contains ^^ords k. , . ^Jf . ; whose . mean '0 

frequencies are giyen^M?^ ,x then a prediction functioh of . 

the foliowing form is uied^. v. — 
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wbere the coefficients Bj^,|^;. ,3^ ap^;-obtalned by solving- the' ' 
equation .. . 51$^ : ;? ■ 

One of the eigenvalvies X is chosen'vto give the best discrimination 

betwaen the categories. ,jThe e if eriy Sector corresponding to this X then 

proviaes the set of coefficients B^^ jB^^^. . . ,8^. ' Th^ dis^iminant score 

^^obtained .from equation (2,3) is then the tasis of assigning a test' , /. 

gument t^^ ;pn#. Of This method achieved a classifi- 

cation:;|GGuracy of about 75% on the set of ^20 solid^ state abs:^rac 

Both the methods discussed in this' section suffer from the ^ 

.sane disadvantage.' They require the inversion of matrices to cbmpute 

th^ coefficients which relate a keywojrd to a category; Vov a ai^ahle .\ 

^ ■- 3 ■ . •' " = 

set of keywords j say 500, these inethods. become impracticable. More ; 




speci'ficallys the increase in storage an^ time requited by. these ' 
.methods is* proportional to the square of the number of kejworde* 

It is clear, from this . discuBsioii that methods which a^e 

dependent on Inverting matrices whose dimensions depend on the si^e 
^of the keyword Set ,are impractieal* ni% next seetion discusses a few 
jaethods of automatic document classification which assen'^ially deal 
; with matrices whose sizes depend on the number of dociMen^s kh the 
coliection. These a^e generally known as clustering techniques, 

2,3 Automatic Document Glasslfication Based on Clustering T achniquea- 

\ ; * The theory of clustering deals with the problon of finding 

natural groupings . in a set of data* These natural groupings are 

Obtained Tias^d on the siitollarity in the attributei^ of the data 

elements, ;An excellent description. of various clustering techniqut^a / 

that hayeAevolved bver ^the years can be found in th&,book" by Sokal 

and Sneath [2^J. In this section we wili discuss the basic philosophy 

of cliistering and show how it has been .p^pplied for uge iCautciaatic 

dociment dlassificationi ^ " , ' ' 

, - ^ The starting poln^f or most clustering algorlttos Is the > 

'^l^nilari^ matrix. Let D " {d, , d^^ ..^.t d } be the set of document a: 

a^A K - " i V, 5 ^) be the set of keywAds under CQneideratioh, 

Each document d.^ ,in D is read entirely ajid all kayworde occurring in 

it are extracted i Ah nxn inat^ix Q is obtained, such that 

Q(i,j) - Q(j|i) F number of keywords occurring in 

common between dpciments d^ and d.* 

d 



The similarity matrix S e&rfnow be directly obtained from the matrix 
. D Iby norroali^ing each of its elementi by using a standard similarity^ 
measura. Jhere are several such measm-es thkt ttaT^been used in the':. 
^^j^JLiterature and the one to, be used for a particular imgl™entati6n'. 

, depende on: the^user recjuireraitnts* One that has been used very widely 
/ is; the tanimotQ .^^isillarity measure [25] -^hioh is as ^follows. Let B 
' be\df dimension^ nin, V Then, each element S(i| J ) is ealeulated as 
follows : " . ■■■■ V " ' ; 



(2.5) 



As can be seen^ t hi similarity matrix la ^ symmetric in pature, the 
diagonal ientriea betng all equal to unity , ^eref ore during Use in 
Qlusterlng only the upper or thfe lower triahgiilar pdrtlOH of the . 
matrix needi to be stored* V- : :/ 

Using thts- similarity ma is . 

first' obtained* This is usually^done by defining a threehold 
parametWk' T that two docments d, and i. " are aaalgned to the 

same cluster if S(l,J> exceMs ^the yalue of^^T, ^Ais dlear that 
as T^s increased two doQumanta should have a graater number of . 



keywords In com^n 'to be plaeed into the same . cluater . 

Several well known methods follow this general procediire. 

Their bamic phiiosophy is to represent . each document as the vertex. 

■ y ■ " ^ ^ " ^ ■ ■ . - ■ .... . • ■ ■ . ■ 

of an undireatea graph and then find, connected components in this 
graph* The next seotion examines their -m briufly. . " 
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2.3*1 Grapph /Theoretia DoQume^^^^ Methods ' ■ ; . 

:= in order tha^ this ieotlbn m^'be clearly understood sorai . 
elementary concepts in graph theory 'need to he defined* . ' 
C^aph t A gybph, ff ^ (V.E) conslets of a aet of vertlc^p 
V ^ 4; set of edgftsE ^ egj' • • • t '^J* such 

;^that*iach edg^^j^ Is Ideatifled with m unordered pMr^tv^^Vj) of 
VOTtices. : The vertices v. ^v, assoGiated mth edge e are ceiled 
the end vertiaeg -of ..The edge © is: said to he. incident to v. - 

.and y,. ■ ^ ^ • '■ : r :/ ^ ■■rv^ 'r^. 

Subgraph : ^ graph g Is^ iaid Vtd he .a subgraph of graph G if all the/ 
yerttcas and all. the edges of g a^e in G, and eaeh edge of g has 
the end vertices in g as In G*^' 

Path : A path ig a finite, alternating eequenqe of vertices a^d edges 

y. e, e , , , ®< such that 

.-l^i;^2^s4. -h^lJn^i^n^^ / 

i) no fedge'or vert ^ appears, more than once in the seq^uence, and 
il) each edge is incident with the vertices preceding and ■ 
following it in the sequence. ; • ;1 : 

Connected. Graph i A graph G la said to be connected .if there is at 
least one path between every pair of , vertices in G, ; 

Complete Graph i A graph in "which there exists an edge between every ^ ^ 

............... ^ - . ... - ^ - t 

1 ■■ ■■ - . ■ ■ ■ , ■ . , ■ ■ ■-. _ ■ ■ ■ 

pair of vertlees is ^called a complete graph t ' \ = \- ^ 

Eaah document is represented by :.a vertex and there is an edge 

r [ ■ .. . - : ^ ) \ ^- ^.^^ ■■ ■ ' ' ■ ■ 

bttween t^Q vertices d * , d, ij/s(i,j) exoeeds a threshold value i. 
An exaj^le la shora in Figure il^l. 



. 24 



IT- 





dl dg 


d3 


d4 






^ - . ,1- 


p -v. 












1 




1; 




d3 








■ 1 ^ 




d4 













(a) 




Figure 2*1 Similarity Matrix, and Its 4psoGlfl.ted^Graph 



■ Using this representatidn Bonner [l] obtains all the complete 
subgraphs of jhe, graph* Each of these subgraphs represents a setg 
of documents which belong to one category*,. Unfortunately, some 
graphs have an exponential nmnber of such subgruphs , and, 4n that 
case 5 such a method becomes much too time consyming to be praptical. 
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' . A variant. of this method^is to obtain the =- connected subgraphs 
instead of the complete subgraphs. This tBchnl^ue is. used by 
. Van Rigsbergen [26]. Th^s la easier to do tut it results in cluste 
, where the .dbc^{ittents within a cluster are not as cldaely, related as 

in Bonner's method. , ■ . 

The-methods discussed in this sec Aon not, on]^ require the 

computation of a similarity matrix but afso a graph- •connfeeilvity. 

matrix.- As larger ahd 'larger document eoliectlons are coM^d^red 

the Storage space and computation time required Increase as the 

square of the-nmnber of documents in the collation. Even f^r a. . 

i;000 document collectioft, the storage \and tmt required for these 
■ methods may beoome prohibitiye. Borne methods achieve a savings in ■ 

Storage space and computation time by eliminating the need for a 
graph-theoretic representation of the similarity matrix. Instead 
they use the method of centroids to .assign documents Into clusters. 
.The feasibility of using such a teckiique has l5eeii investigat^ed by' 
vaj-ious researchers and the following section trlefly discuiaes ' 
some of the approaches taken. , : 

2.3^.2 DoQiiment Cluateririp by Finding Cen^olde " ' _ 

In this t^dhndqpe every docuinent ii examlnea serially and some 
documents are chosen to be;centroidi of the clusters if thejs satisfy 
certain criteria. For exai^ple, Rocchio's method [21] 'de-termln6a\ 
Whether, a dociment should be the centroid of a ciUBter by*testinfi 
if there aee^ sufficiently mMy doeimehti^ locate! In Its proximity* 



SpeQltlcS.ly5 a dbciment is 'a poseible eentroid only if" there; are - 
at least n^^ docmtnta whiGh have a similarity of at lWBt :s , and 
\ n documents which have a' similarity of at least b„ with the given ' 
doeiM9iit.^^ Where n^, "s' ^l -£ P"® parameters which *ean he * 
varied to ohtain fedl^^ * ^ 

Dattola's method [5] further s^mplifi^B this, proceas by ■ , ' 

, diYidingsthe total- dooumen an arbitrary nianber of 

' (/ ■ , ■ /■:"-.■ 

groups. ..Each group is, represehted by a eentroid whose dimension tm 

equal to the number of keywordB in the systwi. ■ Each eleftent of this 

vector is a quantity proportional to' the fretuenoy of occurrenoe 

.-}a kejnrord in all docuiQentB^^^ group. Every document- in 

the collection, aacij^of wh^ch ie alfo repreee^^ by sue h a vac tor > ^ 

li noir eompared with each eentroid ahd-reaeBlgned to the group to , ■ 

^Iflh it vie the oloeest. The prooeae iB repeated ^til identio^ ' 

setg of cluBtere are obtained in gucaessive Iterations* *W^' - 

Both of the aboveiittc^i^es have been implemented in the ^ 

v classification phase , of the SM^T infomat Ion storage and retrieval ^ 

systCT [23]. . Kiey have been usM to classify a subset of atiproxi- 

matelyveoo documents f^om the Cranfield collection [k]. Results 

hav^/'indioated that their classification is very dependent on tbe' 

order in whicH the, documentp aria, examined. /Dattola's methods even ' ' 

though: it is fast ervthan that of^RQcchiOp was found to be extremely ' 

dependent on. the^iM^^^^^ set "of clusters that are chosen arbitrarily 

■ ': A major, disadvantage of both these techniques la that the 

^operations must be pe^foi^ed on the entire doct^ent- . collect ion ' to ' 



obiain satisfactory classification resulta- " Ther^efore, for a ^ ■ ■ ' 
collection whose si^ex, may change dyneuoii c ally ^ these methods are; 

V generally not a^plitfable.' 

Reqently.'Yu [31-] hae^ proposed a technique whlchv alleviates 
this problem jto a ;:certain i;exte^^ It li a vfery nover approach 
because instead of /perforEing'.^e\ clustering set of documents', 

i^t "is, performed on a . set of Uiet queries. First a ' representative ' . ^^ 
set. of queries is^^ obtained. These key^ford rich -^queries are then 
clustered using a^standard technique. Each of these query clus'gers / v. : 

. are now considered , separately and every dociunent relevant to all tht»;i::.^ 
queries in a given cluster is retrieved and placed in one group;. 



Thus clus;jtfrr/in the quety space awe riade.to induce clusters in 



i5the docmeht *spa6eT.* "Iv " -\ 



This Vt^^^ >rell when a data base has /associated 

witli it a representative query s:#t>i However^ for a document collection 
for which no such set exists , it would ne,ed considerably mpdif ication . . 

2,1^ Limitations of Present .Classification Methods . ; . . . / „ 

The techniques of document classification discussed in the 
previous sections are really irepresentative of a wide spectrmi of 
contenipor^y methods. They are representative in that they 
incorporate the' semm basic phildsophy underling' the other methods. 
(^Eaoh has certain advantages e^d .disadvantages which have been 
discussed earlier . In thi.s section several major philosophical 
points related to these tecimiqueB will be noted and it will .b0 . 



outlihea how the claislflpation algorithms^- d^ap^ped in this researeh 
add^esB thCTiselYes to these points* ^s^..^^- 

The following, oheervations o an he made about the classifieation 

■ ■■ ' ■ h, ■ . ■ ■■ 

teclmiqueB- that we have dlscusaed- " 

(i) Techniques which use a similarity ^'matrix are /iQt practic^ 
toT^ large data' haeei lif ing exlSting facilities beeaiise of 



the storage spac^ (-both pore aSSd secondary storagS^ -required 

' ^ ^^^^ - ■ ^ . ^ ^ .^.r' . ' ' ■ . ■ • ■■ 

to store such^ m atrix and'thfc time i^e^uired to' process 

.-.'.^'MP ... . • , , ■ ■ 

(ii) Another disadvantage of usin| "a slmiljlSity matrix for^ 

claspiflcajbion is that the entire do^^fht set has to 

he.availal?!^^ before:' the Gla,a§ification prooM^ be 

initiated. In other wordSs-the dociment collection ■ 

.... J\^v _ — , » , . ^ ■ ' . 

^/iQUst be static in native. , ' _ 

(iii) Most of the methods that we have discussed read each 
docijment of a collection completely hefdre attempting 
classification. Given a ke5^ord set' tk^, kg^^\ . .\ k^J^ 
each document prior to classification is repres^ritM by; 
an. m4dimenBional vector which depicts the presence or , 

^^rwords from the set i > 

ussed work' with a small keyword 
seti Each keyword in this set is related to a pwticulsr 
category in a blnwy fashion, i.e., it elthar helongs to 
a" category or it does nat. Given a keyword set ' 
K « {k J k^5» . . k .} and, a category set 



absence of eachrOf the-m 
(tv) Most Of the methods disc 



C ^'{C^5 Cg, •.is C^K the set K is divfaed ihto t V/^C- 
group^.of k^ords G - , . , }, Each gro\«) ' ^ 

aontaiTTg keywords whieK^a^ a/category • 

:C* . -If a :ke5rvford k. ll present in tvo groupH.g. and g . ^ . :y;>' 
:.: - then it iBj.-ddpsldered to /be eq.ually indlGatiw of olasies 

' . The algorithrts developedjln this reeearch aore haeed &n the . ? 
philosophy an. entire doctment need not te ex^iried b^fpre a^^ ^ 

depieion regtoding its class''ittimhers As. fach ; 

i - keyword contained in a document is exfljnineds information regarding 
its class mempifership 1^ obtained* After a certain portion of "the ^ 
:5=-r ^ document has been read any new ^eywdrd extracted from it may giye 

only marglnaiv-informatipn about its class membershi^pi :WherT thi:B . . 

■ happens 5 it may be more efficient to stop reading the document ariy ;/ y r 
^fiirther and classify it into a category which is most appropriate ^ 
at that point • In othier Vords* the 'claasificatioh algoritluas . V ' 

^ ' ■ ■ - — == -- ■- ' - t ■ - . w , . . 'L-^": = \ " . ' ^ 

delreloped in this research proceas the words sequeritiaiiy in a ^ ^ 
> succession of stages. Each stagC involves reading the doctoient and * 
extracting^ fixed ni^ber of keyrof^.;'- ^If. atv s%age a decisiOQ \ 
can be maQe' about which class sMould be assigned to the. dpcutoent , 
then the process stops ^ otherwise it con.tinueg to the next stage. ; ; 

The motivation behind such a sequential t|^hnlque/ is prbyld^^ in j 
; the early section of Cha^ter^* III. ; , - V 

Aiiother major ^difference^etween the c.lasilf ication algorithms - 
' that have been di&eussed In this cha|^ and t^dsi developed for " • 
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; this i^eBeaafoh f s in the.rapreeentati4h of kajr^ ^Ihetead of iiiimtpg 

that a k^rvror^ls either indie a a category or ft ip notj^ 

eaah keywdrid is' repreBented "by a set of valuei which essientlal^ 

_ . ^ - ' - ' ' ^ ■ ■. • , _ . , ■ . - * 

. indioate different degrees to whieh. a keyif.oi:d ii relatei\to a giYen 

category. Mora apeoifis^l^'each k^rword k. is represjshted ^ t 

prohaMllty values^tp^j p^/ . . ^ p^] ^ where p. represtnts the 
^^robaMlity P(k /G j, i;e%j ^the probability that a document which 
^ belpnga toWategory 6.^^11 contain keyword k;'* Chapter III 

indicates a method by which theae probabilittep mly be calcul'ated.- 
Such a repraientatio3^^:4f keywords has' led to. the development 
- Of a^me;Chod wliich treats a keyword ^as^a t-d;imenBional/ probability 
' vector, A dociment therefol*© can be ripresented as ^:feluster of ^ 

points in a t-di^^ vector space, Ohapter IV eimborates on 



this concept and ehows how cluster e of keywords which relate the 



doc\ment to different categories may be isolated* It is^also ' 
shown how^ certain keywor4B which, mislead' the clQ:Mifieation algorithm 
into . placing a document ir^ A wohg^category can be Isolated as 
'noisy' keywoAs* ^ ^^^^ ^ . :\ # - ^ ^ ^ ; 
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. CHAPim III 



THE ■ SEQUMTIAL C.^SBIFICATlbN tffiTHOD 



- . - As pointed put in section 2.5 of the preTious chapter, moat 
classification algpri^hms in use .today extract all the eonceptu or- 
attribute, present in a docmnent before Inltlating^lassifieation. 
All the conQepte ^present in a doQ^ent need not almya . be ^n^i^aem'ed 
bgfore^.lts subJec^^coM can be determined* This observatloii 
foms the basta of the sequential classificatiion method^ that will 
be di.iQjjBa^vin thie chap^ter^ . 



: 3#1 Sequential Daoision Model . fdr Pattern Claesification ^ , ' 

■The /problem Of automatic document Qlaesif leatlon can be likentd 
to the classical -pi'oblem^of ^ statistieal pattern reeognition where 
there Exists a set of categorifes^i and . each categorjr iu cliaraeterized 
by a set of attribute-valfp pairs: These eharacteristic^- attribute- 
value pair|;>;^|&;^pbt^ set of trainingjpatterni .aidh of 
which. ii taio^ffi'to belong to a certain category* When a test pattern 
ajfriVfSi measurements are made on the attributes or features 
contaihed in ^^^M^s test pattern, and based upop these measurements 
and a decisioit criterion, it is clapiified infe|one or morie of the . ' 
axlsting categories. For the problem of doc umeht elaasifidatlon^ 
the categories are the various aubject Qiaases which describe the 



/■area Of diecouree of .a QOllection of documante. The attributes are* 
^ keywoMs present 4n the document aollection and the training patterns 
.axe 'preGlaBBified sample dOGtofht a from whieH characterietie key\fQrds 
are ohtmined for tht^ Various classes * ' ♦ ' ' 

; With thii anaJLo©r in mind wa can now fOMiulate a tiasla for ' - 
sequential dootment etoasifiGatdon techniques hased on the theory - 
of sequential pattern recognition of Wald [27]. and^^^ [12] , Suppose ' 

ha.ve a set of Categories C ^ Cg,\ C^} and a set of ' 

features K - {k£, kg, \ k^} describihg these categories. In. 
non-sequenti^ classif idation theory all-the N features present in' 
a test . pattern .are observed by the elassifier at one stage/ This " 
might prove to be' impractical 'if the cost of making featwe aeas^ 
m*CTients is taken, intp'^ consideration, .^^Instead only tas many feati^s , 
may ^e measwed as are needed to classify the pattern with a giver^ 
classification accuraqjf:. ^ Besides af ter % cert ad^^ point more feature 
measur=ements will /not necesseLrily increase\>alasslf ication apQuracy, 
A trade-off ■ between the^^^^^^^ misclassification and* the number of 

f^ajbu^res to be measij^ed caji be obtained by taking jf ekture measurements 
sequ^tial^ aM temlnating the sequential process wlien a sufficient 
or desirable accuracy is obtained, =| 

^' ■ If there are two pattern classes and Cg to be, recognized, 

- '" ' ' ' " f ' "th 

then at tlie n . stage of the sequential process 5 thaf^ii s . after the 

■ ' ± 

n feature measurement is made, the classifier computes the 
sequential probability " ratio given by, equation (3.1) , . 



, vhere P (X/C.fe) 5 1 -'Ij 2 -represe^ the aultivja^lata coiiditional 
probability deniity fmiction P„ (3^^ X^^ * . * s X^/C^), and ? 

'0^5 . , • 5 the features that have been meaev^ed so far^ . ( 

* ^ " . th 
that xB^ 'yhen the- end of -;the stage is reached. , The X eon^uted/f . 

by eq^uation (3.1) ^Is tjien compared with tw^^^ bounda^'ies A ^ - 

and B. tl^X A^tKen the deaiiiori is X ^ and if % £ B then the 

n . • ^ " . 1 n 

- • , . ^ - , . . . . ■ • ' - " ■ " / _ ■ ■■■ . 

decision is X '^^^ C^*^ If B < X < A, then' additional feature 

mtasurement IS' made and the procesSi'jroceeds to t (n+l)^* staged . 

The two stopping boundaries a^e related to the!i.miselasBifieation* 

error probabilities by equations (3-2) and (3.3)* ' v . v;^ 

. A ■ ^ ^ . ism 



. B^%:~-^., ^ - ; ■ . : (3-3) 

■ . . ■ • ^ 12 ■ , / 

where e, is the probability of deciding X C, when actually X G 
is true. It has • been shown by Wald [2?] that the sequential 
probability ratio test (SPRT) ^'^ptimal in the case of two pattern 
classees that* is, for a given e^^ and Sg^ th^re 'is no other procedwe 
yielding a smaller average ntmber of f^atm*e measurements than SPRT. 
For more: than two pkttern classes, say^C^i Cg^ , . , , . , a genetcaliEed 
sequential probability ratio test (GSPRT) can be devised. This is an 
extension of the ^%wo class^^^ye ^ but the, optimality property pointed 



out' above nb longer hoidB*. Fu [12] ^ however , notes tha mopt 
cases it oan be^shown to be cloae to QptijEial* ' ^ V ^ 3 
)nce again at the n stage the following statiitie Is oomputed 

U^(X/C^) is then compared with the etopping bouJida^ of th#, ij^^v 
'pattern class C, and the^:.d^^ reject the class (3,. 

trom. fi^^ther considaratiQn if ^ ^ ^ 

u^(x/c^) < A(e^), 1, av t, ' 

where A (C.) Is the stopping boundwry for class Cj . At each stage one 
or^ore pattern ej^asses may be^^^^^^^r^ from consideration and a new. 

set of generaiiged seciuential prbbability ra may be computed* 
pattern olasses are rejected sequentially until only one is left ^ 
which is accepted as th^ re^dgnized class. In some practical paseBs 
'however 5 the number of feature measurements may become extremely 
latge and it m^ become necessi^^to truncate the ie^ential prooeas 
at a. predetermined stage* . - ^ \ ; - " ' . - 

3*2 Sequential ClasBifioatiori, Technique fo^ Documents 

As pointed out in the previous section the problCTi of automatic 
documt^P classification ie very similar to the classiGii pattern % 
recognition problem, Giveri-a docmieht collection a ni^ber of different 
subject ^eas = describing the total area of discourse of the collection 
Is' identified p Each of these subject areas deijotea a different 
category, 'S; test l^e a tesW pattern, is how exmilned to 
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V -determine the categorjr to /HhicTi it belongs. ' v . ^ ^? ^ ; 

>:>'^ piere aJ:*©, however^ seve^*^ kipects In which the two ^pplica'^' 
tlons. dlffiii* The'^toblem of pi%ttrn. recognition can usuilly be 
formulated by ueing rigbroui mathematical techniques be cause the 
^probability dai^ity f^ oi^he' feature are uiuaUy known ' 

or can be es-timated. In the case of , dpcument .crassific^^ v 
rigor is lacking. The^^feat^es utilised In thejcase o^opatterh 
recognition now take the form of ck^ Since keyifords repreient ' 

ideas and hot numerical quantltiesj the decision as to which keywords 

■■■■ . ■ ^ ■ ■ . ■- ^ : . " • * ■ ^ . 

best represent aj:^ category JLa^^^wa^s subjective. Also in most^caiesi 

exact probability distributions of kejrwords oyer the ciitegories they\^ 

descjfibe etoriot , be obtained i ^is saSJb|,pn describes a sequential ^ :\ 

dociment claisification procedure which i,f:j;based' oh the GSPRT outlined^ 

in the previous section*^ ^ 

As in the case of pattern recognition we have the followlrig , ' 

/| predetermined items to work with: ' 

, category set : {C i C ^ * . C. }, Bach category C.\denotes ' 4 siibjeet 

■itreas and taken together , all the'sei subject areas describe the v 
.^completa area of discom'se of the /document colleetioni for instance, 
if one is dealing with a set of scltentifie docimenti, the subject areas 
mi^t be ^fethematicss Physicsv Chimistryi Biology p etc* . ^ 

fceyword set i (k^j kgj k^}, ^^es^Jcorraspiond to the 

set of a" pattern recognitic^ problemv Each keyword present in a 
dociment relates it ,to one o:^/ more of the subject areas chosen 
above. ' - , « , * ■ ' ' " ' , 




^^ keyvTQrd ^ category prb^lll-by matrix : ^||ttjB^^siqUeftti 

■ recQgnitlon. prob^m^ after a featm*© X||e measured s the computatiph 
; of the sequential. prd'BI^ ratio tesJ requir^^he ava^^ y 

■ of the cionditiona^probrtility . P(X/C. )^ ^ ^ 
. In the cpnt ext of dooment Glassif icat ion ^ : theffe /aire the prohahliity 

.values :p.(k^/G^) , • . * , P(k^/C^)^ wherenow^'ha features are represented 
by keywords. Instead of being probability densities ^ i^hesa are 
discrete proTbability valu calculated from sample doc\anents, ^^Thus 
th^ matrixpG^^ shown in Figure 3V1 is required to con^ut^the prob- 
. ability ratios' at each stage where P(k /C: ). is the probability of 
keyword k , ^beiri^-present in a dodument which belongs to category , 
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Figi^e 3%1 Conditional Probability^ Matrix 
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■ The method used to caleiJ^ate this conditional protability matrix will, 
be described later in the chapter. Classifieation of dddumrfnt is 
then^darried out as follows r-. • ^ ....... ^ 

. _ The doQiaent is read until one or more keyvrqrdi e^e foi^d*- 
. At each stage 5 i^a.^ after reading eaeh keywordjOne of the 
follo^ng three decisions is madei * j , 

j (i) the dociiment is clasiified into one of the t . 

_ categorieij m 
^: (ii) the document is, termed unclassiflablet, or - 
(^£ii) more of the doiiament is read becausd ne|ther 
\ ■ of deeisione (i) or (ii) is taJ^en* s. 

Let us tm'n to a consideration of the probability-ratio test/-- 

and the prediction methods to be used. Suppose we are^at the n 

. ... . . - ■ . , . i : - - 

stage of the sequential prooessj i.e. 3 ii keytforda have been read ['^ 
; sM they are ^ W^s . * . » W j where each is a member, of the set 
K^^ {k^5 kg, 5 ^},.^ , The ^probability W^/Cj) that 



the seciuence of words W^^ W^, .^.^ will be in a dooupent thect 
belongs to category GyvM to be calculated* Iftis will be com 
by using the probability- estimate P<W^/C^ ) obtalried^^^-:^ the ma1:rix 
CP discussed ee^ller,; - Since here we are estimating the probability 
of finding a string of keyifords in a document which balongs to a 
given category 3 an issue that should be discussed i^.^ whether ; . or 30^ 
the keywordi should be considered to be or Independent of 

each other* The next section briefly addresses this problem. ... 



The calculation of PCW-^/Wg^. * ."pW^/Cj ) can be basad on tha 
premise that the keywords are dep^ende^t on t^ch otheA A prediction 
formula which takes into aeDOUM- the dependence' of kfe^ords la 
presented in equation (3.6) below. ^ It should be noted here that thi.s 
is ; just one of thfejnany ways ^In which the . effect of keyword^^ependenca 
lay be captwed,/ XTilng conditional probabi^it'iM noting that . 



m 



we can sa\ 



P(A,B/C) ^ PCA/C) • PCBMtC) . ; (3. '5) 



Let us isolate any one term fi?om the right hand side 'of* 
equation (3.6) 5 Bay.P(W^/W._^V.^ .W^.q) ^ This Is the proBatoity thaj 
a dooiunent which belongs to category and contains the keywords 

■■^ . - •-. . ■■ ■ . . ■■ ■. . ^ . I . 

Wj^^s I V* * will also contain keyword ' . Since ^ each W ii| 

''^2_^^2^" * ' ^^i^l be any k«3rw^ord in the set K defined before ^ this 
'probability has to^be calculated for mi"l different cases 1 where m i 
tfie ,>number of keywords in set Considering the dompleKity %f 
this ^ask^.a simplifying assi^pt^ion is necess^y. Fried [n] , in 
presenting his classification model simplified -eguat ion (3*6) above | 
by .assuming ;t hat a keyword is independent of all others that precede 
it in the dQCument. As we shall set later ^ the' sequential probability, 
ratio test does not depend on the aBsimptlons of exactly how the calculation 
of p(W^jWgs,/;. ,W^/Cj) Is.madet Hancr if tha Glaislficp.tion algorithm, 
performs well pider ^the assisaptlon of keyword Independence, which ia 



. a worst case assumptionV it should' do at timt as well when keyword 

■ • ■ .'I "\ '\ 

. def endance is donsidared.- Thus Hising tbe .assumption that, keywords are 

"indegfendent of each other we have 

%^ . ■ • .:■ , _ ■ . - 

r' ■P.(W^SW2....,W^/C,) - PCW./C.-)/ . (3.7) 

2 _ . ' , X^r , \ 

;3,,2>2 /fa^^ability Ratio Test and Stopping Bound^^ . ^ ^ 
^ \ ■ ' - ' " 

^ ^ ; Aft^r t^^^^ n keywords W^^ Wg, have been reaci and 

PCW .W^p.., ,W /C,) for j = l,2,...^t- has been calcui4tad at the end 

1^ ■ ■ th ■ _ "'---^ ^ .'^ . I. ^ . . ^ . 

Of the ,n sta^e .of- the sequential process^ a probability ratio test 
has to be perf^mea/^o predict the clas^ to which the' doeiment ' 
may^ be assigned. For each j and for each: finad n, let P*fw, . * /C.) 
be ^ pii;gbabilityr.distrib;iition over the sequefice Wj^^W^^ * * * ^W^* Than ^ 
for each j this is a discrete distribution over points " o 

where is the total number of keywords in set K* A' sequential >^ 

^ ^ ' - ^ ■ , . ' - - ^ - . 

deciaion , rule can then be formulated as in GSPRT^ -by computing tWe; } \ 

ratio.: " ■. - - .^-^ - '"v 

- — . . - ^ ■ - ^ . , . . ...» ■ ; ■ . . . \> ; 

P(W^,H,,.,...W^/C.) ) - , ' ■ 

. = Fried and his'coworkers' [11] addressed themselves to this I 
problem and formulated the following. decision rule* . / 

At. the n stage of the sequential process . the ratio a. is 
computadvfor each value of j ranging from 1 to t* This' a* is then 
compared with a predetermined thresh'old value a* As long "as more than * 
one a'j is greater than a, the pj^cess- moves on t^^the stage 1 

If at any stage only one a_. is greater than a and all the othei^a are 



v -Iess 'tlian a, the the document is ; V ' ' ^ 

plassified in class ;C. for^whioh q< is greater than a. 

r^- ' Fried then studied various forms of P* that could be used 

:r'^* -"■ - ■ ' ■ v-^^... ' n ^ • . * . • — ' ^' ' 

tO:,domputf the probability ratio. The following forms of were - 

■ ■ ' .'' ■ ' : ■ . - \ ■ /■ ■/ ^ ^ ~ - ' ■ ■ ' 

investigated: ^ ; • , > ^ v-^ 

. (i) w^rt ^ , -v, ■ V ^ ^ ■ - 

. ; . . .... ■ . m ■ . , / :■ ^; , ,■ •. • 

Xm - Cnirn)! 

• •■ •■ n m! . ■ • ■> 

(lii) p|tw^.w2,..:iyc.) = >n(Wi.v...vVP^^i^ ■ • 

^^ere P( Cj ) is the a priori probability of Glass Cj . P*^ given fey (i ) a^id (ii ) 
is Independ^rt of the keywords 'und depends only on n^ the stage of - 
' the sequential process V In their eKperiments they found that (ii:!) 

gave the, best results in terms of the number of stages. that are 

required by the seq^entlal^^prooes before ^ a document can ^be blassi 
In this research the sequential probability ritio is cbm^ 
^ by -using Bayes formula for Gonditional probability . It can be 

derived; as follows*^ \ \ , 

Let the keywords that have been read at the h^^ stage of the 

sequential process be Wj_,W2,... J If , where. each W. Is a member Qf the' \ 
'keyword set K - {k^,'k2,. . . ,k_^}. Let P(c^ ) be the a priori .probability 
of a class before, any keyword has been read. Then for each class C 
the. probahility pCWj;,^^, . . . ,W^/Cj ) can be computed' as before. Then 
the a posteriori probability of class C. after W^.W,,, has 
occurreia is given by = : , , * . 



' - ■ • ; " ■|j^P(C.)-p(W^ 

This is' the probability that 'a document containing keywords .W^ , * . ,W 
belongs^ to 'olass 0^ For simplioity PCCj/W^^Wj, - - . ^W^) will ba ^ V; 
denoted by a^. The value of oij is now comparea wi-fh a preset t^eshold. ■ 
, . value, a, as. in the Fried model* Bas«d oTi this eomp^rison one of the " 
'following two daeisions . is m§de- w ■■■ ■ - :^ ^ - V V 

' (i) if only one ctj a then the document is Glassifi^ed- in the 

■ corresponding class C; : ' - 

• ' ■ ; ^ = = ■ ■ 

Cii) if ^more than one^ 's are such that ct^ > the prooeas 
; . . . mpyes on to the (ntl) ^ stage i*e* p. one more keyword is \ 

^ . readi if the document dops not contain any more keyi^ords^^ 

> it is considered unclflSiifia/ble* ■ ■ 

; -InVas.e (ii )^ tkose claseti for "irhich the a . Valuea are less than the threshold 

I are delttad from consideratibnaiid only the 'ones whose a, valMi gi*eater 

-y . , , L .. ' . " ^ ^ ■ ■. = J 

than or equal to the threshold are retained. ^ ^ 
. interest of raduolng computation time as will be poijited out. in section 3-3 
^when implementation of the sequential tectaique is discussed, 

3.2.3 The Parameters T and. R 



In the actual implementation of the sequential method^ the 
•probability ratio test is not necessarily computed after every \ 
keywSrd. This is because many instanoes the very first keyword 
may be a strong indicator for d part icular class and a poor one for 
the rest. of the classes. This might lead to a precipitous :deGlaion 



at § very. early stage in the sequential process. To avoid this a- 
= parameter T as .introdu^ no ratio is computed and no " ^ . 

decision regarding the classification o*f a document is made unttH at . 
, least T keywords a3?e read. 

Again it might be computationally wasteful, depending oh 
the quality ofl the kaywoi;dBVai^ the. natu^^ ot the data base , tO ; \' 
compute the etj values after every keyword subsequent to the first T 
keywords. To control this, another parameter R is introduced so that 
after th#; first T keywords 5 the a-tests arfe conducted only after 
groups ^pf R keywords have been readi^ the fend of the first 
' stage of the Bequential process, T; keywords hare been read, and; 
at the end of the second stage. T+R keywords have been read and so 
-on uhtil at the epd of the n^^; stage [T+<n-^ 

read. Depending on the data base under 'consideration T and R may. ^ 
be. varied to: obtMn the desto classification accuracy, 

3»3 Calculation of the Conditional ^obability Matrix CP 

' As indicated in section 3,2 , the sequentiil classification. 
^ technique assumes that the kiywo^^lass conditional probabilities U 
are available for all possible Vkeywof^d^clas This is represented 

as a conditional probability matr Ik CP, each element of which. is the 
.pr^abillty ; of a keyword occ&rtng In^ a document^ to 'a-, 

particular clasa^. We will now consider how to calculate the el^^ - 
of such a matrix ' ^ 

v.. Given a set of documents, m set of classes C^^C^^ . , , ^C^^ and a 
set of keywords K « {k^skgt . • • skj^ii the probl^ to compute P(k*/C.) . 



-for 'eac^^^ To do, this, f^l the available documents 

.are divided into two sets called: thB sariipl^ nd the test documents. ' 
We assunie that each document in the sample set is already classif 
into one or move classes manually. Keyword ^equency tables by 
classes\^a^then be^preparfea usto^ the ^sample^ftt. Let Dv represent 
•the subset M the sample .document^ associated with class and r. ' 
denote the^number of occififences of ke3^ord in document d_ . Then 
the frequency of keyword >k. given that a document is in class C. is 

These frequencies are calculated for; each keyword and stored in a 
frequency tables: Then eich elerrifent of the CP matrix is calculated 
as follows: • •. , ■■ • ' -r 

1, p(wc ) -^S;: / . (3.11) 



where m is the number, of keywords. For any class C / let 
K -{k 5 . . . 5k - } denote. the set of all kejnfords that have occurred in at- 

least one sample document belonging to class Cj, Then if we am 
both sides of= equation (3*11) over all keywords present in Kj we; have 



kv-eR . i • j 
1 ] - 



f(k /C.) 

3^1 B ] 
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Now m ' ^ 

E fCk./C.) ^ I fCk /C.) because 



•for any keyword k. cpntained in K Eut riot doritainid In K^. thL^/C') 
is equal tp zero, therefore ^ P(k^/Cj j^l. Hence the se^ 
values tECkj^/C^),...,P(kjp/Cj)} is a probatiUty set, \ 

Since during classification these, .quantities\do not change^ 
they neid tO; be calculated 'only o stored* The a priori ■ 

^^probability 0f a'class ^ denoted P(Cj )^;pri^ to the initiation of 
the algorithm is calculated, as' w : ^ ' . : ' V ^ v 

■- ftk /e ■) -^-^-^ 



Assume that a keyword k^ occurs ^only in sample doeuments 

belonging to a given class^'C,, Then all '^lufes PCk./C ) such vthat p 

■ . , . ■ . 1 f .• 1 p 

is not equal to j will be zero* Now assume that a test document is 

being classified by the sequential procedure and at a stage n keyword 

k* occurs in this document* Let the n-i pravlouske words b# 

/denoted by , then sinee P(k ' ^k. - /C ) 

»,:JR(k, j.^^k. /C >*PCk/ /C )p the niMerator of equatiOT^^ 

■ . _ ""l- : -r^n-^ n P ■ , , ^ ' ' 

becomes iero for all classes Cp where' p ^Vj^ Heiic|^-ap for all p ^ j 

becomes ^ero* ^ Thus all classes p f \^ are dropped ft^om considera- 
tion and C, is chosen as. the correct class since a. > a* Thus 
we see that the occurrence of one keyword has lead to such ^ drastic 
/4ecisibn| no matter how high the values had been before keyword 
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k, occurred. The situation would be even inore critical if*k. 
had b^n the first keyword read. Then the document would have been 
. classified after just one ^keywdttd^ : avoid such precipitous 
decisions, every aero entry in the CP matrix is: replaced by a small 
values sTnaller than all other ralues in the matrix^ Alsoi it should 
be noted that if the sample set chosen were Imrger, theft it is highly 
probably that the number, of non'-zero entries in . the CP matrix would 
have increased considerably. This prnkll value which replaced every' 
PCk^/Cj )' which equall* ^ero is called the default probability . The " 
implication of such a replacement is that equation' (3.9)^ which 
v/calculates the values, is now not truly Bayesian because the set 
^^.P^^^^^W-ities of keywords assbciated with a given class no longer 
sum to unity, as, pointed out earliier in this section* Now every 
keyword -is "associated^V^ith every clJass because; evenv those tlmt did 
not occur in any sample. document belonging to that class have a 
default probability assigned to them* Obviously the smaller the 
value used for this default probability, the closer is equation (3.9) 
to being ^ truly Bayesian., Whe^.mplementing the sequential method . 
on a digital computer' there is a limit/to how small this value can be 
made* Scaling tecHniques have been deTeloped to alleviate underflow 
problems caused by using very small values of the default prohpbility* 

S.M Storage Technique for the CP-^Matrix 

The CP matriK is stored as a hash' table with the kejnwords being 
the hash keys. Each entry^ in the table stores the keyword and the 
associated probabilities of the classes. This method seems to be 



ideally suited for the sequential GlassifiGat^ as 
a,n extensive amount of searching is requiredi. Tests have skoTO that 
processing time; is ponsiderably lower for search techniques employing 
hashing than sequential or bln^y searches* <^ 

The size ^ of the hash tables were designed to provide a 50-80% 
/loading fatter , which yielded a low: average number of probes of the 
table aiid resulted irt few collisions; Day's algorithm [6] was used 
to resolve collisions. ... 

. - . ^ v'.-, , ■ ■ ■■ X ■■. ■■: 

It might seem that considerable amount of core storage may be 

- saved by storing the keyword-frequency table in peripheral storage 

:such as a dijik. Hbwaver, if this is ^^e search tljne will increase 

- ■ - ■ ■ ■ ' ' ' ' ■ . / / ' - \^ ^ . 
sharply and may become prohibitive* For every keyword extracted from 

■ _ H ^ . ; / . : ^ ^ . ^ ^ : 

, -a docunientv the keyword has to He/ accessed at least once. ^ 

_ % 

Therefore If this table 1^ stored in a disk, the I/O time required 
may be so high that it offsets the saving obtained in core stores. 

■ ,'■ ■ ■/ / • ■ ■ ■ : ^ • 

3.5 The Stop List ^ ' - 

In brde^r^ to speed up the, process of eKtr acting keywords from 
a document, a stop list of very, conmonly occwrlng words , such as "an'* 
"the'% '^when"^ etc ^ is maintained* Eviry word extracted fi^om a 
document is first searched in the stop^llst. If It is not contained 
ln>the stop list, then the keyword table is searched. Like the ^ 
keyword list, the stop list is.aiso stored as a hash table. Since 
most words present In. a document are non-keywords, the probability 
of finding an eKtracted word In the stop list is h^&r than finding 
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it in the keywo A list *- If there were, no stop -list, thin ' ihm kej^ord 
table would have to be accessed 'a nmnber of times before it; could be 
determined that a given word was not a keywird. In gBnerai, consider 
ably fewer accesses of the stop ^ist .:Mbuld be required t© reach ^ 
the same decisioii, ' This is the main reason why maintaining a stop * 
, list is advantageous, . . \ ^ 

3 * 6 ImplementatiQTi. of the .'sequential Method. 

The basic sequential method discussed in this chapter \^?as : 

.developed and implemented by White. and coworkers [28]. Detailed 

-dlsqiissa^ of its implementation and the results obtained by its 

use on various data bases will b6%iound in the reference cite(iV ' Here 

we shall present a few points of ffterest about the impleihentation* 

A flowchart Qf the sequential method is presented.ln Figure 3,2* 

As can be seen in the . flowchart the implementation^ varies in one 

aspect from the theoretical sequential algoritlim. Instead of 

-Staining all the classes unVil a decision is mad^s whenever the a 

value of a class pj dropf below the threshold a, it is eliminated from' 

consideration* This is done in the interest of reducing compution 

time. For each class that is retained ssnf extra multiplication, 

division and comparison is required. Besides, classes whose a, 

%} 

values :are. less than the threshold may cause underflow problems if 
they are retained. .Since these classes have' low ij values it is 
very likely that keywords occwrlng in the document in the subsequent ^ 
stagsi of the sequential process will ha^e default or vary low 
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Figure 3*2 Flowchart of the Sequeiltial Method 
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probability values associated with them... Jlence^ dwin^ 
calculation Of their values ^ the nmnerat or of equation (3,9) will 
Invoive multipliaation of a large number of , small quantities and 
may cause underflow. For systems with a very large number of Glassed 



this might prgve to be a substantial overhead* Of course if all 

an 

\ 



classes were kept until a decision is madej a more aceitt^ate decision 



should result, 

r :; It should be noted here that as a repult of this decision to 
drop classes whenever, the a j value^s fall below the prespecified 
threaholds a slight ahange has to be made in the calculation of the 
a priori probability of the classes P(C^), This was done at the / ■ 
very beginning of the sequential process "^fe^^sing equation (3.13). 
Now it each stage of the sequential prooesSp when classes are dropped ^ 
^he a priori probabilities P(C^) are recalculated coneidering only 
thejiset of classes -that remain at tha-t stage ^ 

,^ ( The afequential algorithm was applied to the SPIN data base 
,6irttaining a total of 500 documents^ Ea^h of these dbcumients ware 
classified by the American In^itute qf Physics (AIP) into one or more 
classes of .a set. of seven categories. Classification results obtained 
by /the sequential algorithm were' compared with those of the . AIP to get 
an estimate of how well the method has. performed*^ A sample of the. ^ 
best results obtained is presented in Table 3.1* Several values bf the 
three parameters a , T and were used . / ' ' 




.Table 3,.l E^eriments Varying T and R . 
with "a 0,13 
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99. . . 
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= .Variation of a did not produce, significant change^ in classifi 
cation accuracy. With an increase ^la^^^ increase in ' 

accuracy was obtaiinW. This is to br expected bee is \' . " 

ii!breased 5 more keywords are ^|ad before^^any a-tests ar^ performed. 



This means thai more keywords are examined before any ciasses are 
dropped or before an attempt -is made to classify. a dooument 4 . Hence 
jprecipitous decisions are avoided. Variation 'in R did not produce " 
significant change in accuracy. However , afe: T ^ndR are 
eyen though classification accuracy increases^ fewer documents are 
actually classified , by the algorithm. This is because now fewer - 
documents ha^ enough l<eywords to satisfy the T and R thresholds. 
These reaults will later be cpmpared with the results obtained by 



applying the revised sequential method of Chapter VI to the same . ^ \ 
data base. 

The next section takes a critical look at the sequential, algorithm 
%nd points out some of its limitations. These* form the basis of the 
philosophy of the revised sequential. method that is discussed in 
Chapter IV and V. ; . ■ 



3,7 Limitations of the Sequential Method " . ' 

In order to understand some of the limitations of the Sequential 
methods it is necessary to look at the most critical step of the 
algorithm 5 viz, J the step that calculates the a pbsteriori probability 
of the classes, the values. Suppose that at any given stage 
of the sequential procees, a string of keywords W^^ Wg, , . . , W^^ 
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ha^e been read fipom a^ d^^ each W^^is a member of the 

kejn^^^^t K. Then -the a poiterlori probability ,f©r a Dlass 
denoted by Is^ oalGiilatai as follows i 



p(c.)pCw;/cJ.v*pcw7c*) 



g^i ^ • ;^ , - ^.^ '^'^ ^ / ^ " ^ .(3.14) 



There are several salient pointy to -note about this Bayes la 

First 5 the kjeywords are treated as biing completely indeperiderit, 1,4.,^ 

: . •• ^ ------ ' ' ■ _ ^ . . • ■ ■ ■ ■ I ; 

tile following is assmnedi ^ - ; - ! 

' ■ " . ' . . ■ ^ ^ ' ■ . " - \ - . ■■ ■ ' - ' 

whereas/ in fact a mora rigorous . and accurate technique would be to use ' 
the fqllowlngr / \ J| . '* ' ' , 

VA criticaL look at- ejjuation (3,16/) shows that in order .to cons idar 
keyword dependenoes ft^equenqes of co-occurrences; of words have to be 
computed^^ Let us consider , the calculation of P(W^/C,|W, ) for Instance. 
We note that ' ; , ^ i ' 



PtW /C ,W-i ^ , ^ ^ . (3:i7). 

P(W^ /C • ) is . already available from previous 'calculations * Therefore , 



2 5Wj^/Cj^)' has to 'be 'calculated* Toi^ every sample docuirieht\,that 
'belongs to C , counts would' have, to be made for the ..number of times 
Vl^ and occur .together Ip.. the same document* The prabability is 
then' obtained by dividing this count by thf -tota! count, of the 
frequencies. of every keyword pair occurring in' class C../ This has 
to be done f or^ every possible ke:^ord pair; " ; Fbr ^three keyrord 
de|>endencies the ctmputationi^becomes .even more cm^ Thus 5' it 

becomes clear from^ these \two, forms'' of the same eKpression and tH^ 
discuisian above that an Jisfeumptioh of keyword iridependence 'reduces 
the comptitatiopal compleKity'of the problem Imipenselyi^ 
' . ^ Besides being cumbersome to compute , equation (3. IB) fails to 
take* into account the entire dependency of a" keyword on every othir 
Ifeyword that Qccurs in a document. In the string of keywords . 

^ \, / ^^ ■■.-^^ . 

^1^^2' ' * * ' f^n * ^q=^a''bl©n (3.16) computes the probability based^oh 
the assmnpt.iori that the occurrence of depends on the keywords 
that have preceded it In thq document, Ther^efore th#. dependence of 
tfv on keywords occurring after . it is not captured. ^^ ^ 

rhe second major point/ is that^ because of the assumed 
independence f .every kfey word's effect, on th^ii, vfelpe for a particular 
class is' dirfectiy proportibhal/to^ i freq^uency of occurrender in that 



clas$.v ; In other worcis^ let us ^assuine tha^^^e have three keywords of 
moderate probabMltlos p. 5p^,p^ ''from, a^class C ^and one keyword of 
very \high 'probability p ft^om, a clasa C Let the default 
probability being, used bu denoted ;by d. Then at the^end of the' 
fourth keyword thie ritio/of the.^ttj Values ^ two claiees will be 



givan by: 



■ • ^- . . ■ H ■ ^ ' ^ ■ • ■ - - ■ / : , , . 

-'^ . . ■ - ' . --.^ ' .. ^ ■ ■ . 

\Dependiiig on the valxiai of 'pj^tPjfPg and p^, it is possible that this'^ 

ratio ii lesi than raltys^i,e. 5. p^ d > pj^.^pj.pg. If this la the case 

then the ef fact of threa keywords .ii nullified by that of ;a slngla . • 

. kaywDrd, If the doQiment actually ^eloiige to.C ^ thes is a desl^rtle-^ ' ^ 

situation; if it ^mlongm to C^^ thtn the fo\a-th keyword is a noisy 

word and should be reeognlged ab such* * - " ' 

From these two QbsOTv^atlons made above about the eequential^ 

tnethodj one of Its major'drawhaeks oan be deesribed as follows. 

The sequential method is unable to isolate inappropriate keywords 

from a set of good keywords « As in the example given above j \.the 

fourth keyword with probability rnay be a noisy^.k^ 

be redognizid as siich and not considered for /the probability^ ^^^^^^^^^ ;\ 

calculations* It^ is also obvious that a deeisian about the iMro- 

priatenasa .of this Keyword has to be taken in the/Qontext of the three 

keyvorde which have pi^aviousl^ .been Ide^^ Thute ^ even though we do 

■ ^ ' ■ ' ■ ^ T - r ■ 

not want to have to aalculate keyword dependeneies' at each itagi^ wa ' 
do need. a measure of "closenisis^' betvfean keywords. This measure 
should be able to iSQlata noisy keywords from good onas bastd an the 
distanoa between keywordi, Xhe.deeiiion regarding which keywords 
should be considered as noisy may change as new keywords; are read 
from the docuTnent. A diatance meaaure that does this affectively' 



. : yiill hm ihtrbduced In the nextr chapter. ^ ^ " : ^ . ; : ; 

:= :iv:;\. jAnpthe^ 4isadyantage;which is not appaK*ant from the abovB ^. . ■ 1 
discussion ip that the sequential method lacks the power to .classify:! , 
dpcumehts into tnore than one class in a systimatic fashion. ^ Very ^^^ 
often 5 based on its subject co a document should be classified 

into more tha^ one class* Sinoe at .etage of the claisifiQatlorr 
proeessj tha tt. vajuep^, denote the m postwiori probahiHty /of 'th ^^ 
classy at. that pAlnt, the ^sequential; method can use this information , 
to assign ift document to a second class * ' Let us say that wA 'have a . 
three-class pMblem, i^e.s ^an incoming documerit may be classified, in ■ . 

/ classes^ ^1*^2' '^3' the point of claislfioation let the a posteriori r/ 
..probetbillttea be given by a^^agsags where we will assume without loss : . 
of generality that classes C^^Cj and Cg ^ar,e brdere4^|ch that 
a being the-preset threshQld level,. Thev primary^elass-is of ..coi^ 
Cj^5 but how about a . secondary .class? ' . . ' ^ ' 

J ' Several ver^y simple-minded techniques :,may be,^ . used to a. : \ - 

secbndary class at this- stage. Since a^>ot^ we may. decide to denote 
C2 as a secondary class. However 3 this technique wili obtain a ■ ^ 
secondary classification in every case irrespective of vrhat her or 
not it is appropriate to assign such a class. "Even ;thoughfi.a^^ and ot^- 

\ cach .inight be very small | one of them will always be chosen aa a 

. secondary class* . ^ v . ; = 

' ■ ■ ■■ , . • i . ■ ■- _ ■- , . ' ■ . ^ 

^ . One way to handle this pijdble'rn Is'^to require tha-t ahoiaria be 

V . ■ , - . ^ ^ ^ : ' ■ " t V * " ; ' ■ / ■ v" ■ • ■ ' ^ ■ >.-^ 

comparable to a^.v This can b$ donefcy means of a ratio teet: ao 



- ■ >. 



/ 



, ^= ii) if the r^ktio exceeds a predetermined threshold a then ' 
. denote a§ a 'Secondary blass, i -^/ - ^ ■ • ' . 
A vai;iant;of the patio test where the correlation coefficient between- 
the primary and the .possible secondary c^.ass ' is taken into ; ... ." 

account .was actually implementea in' the .sequential method 'by Wh iti ■ 
and^ coworkers E283. A secondary class would be obtained only if, 'in -r^-if 
addition to satisfying -the -ratio test, the correlation ' coefficient 
between the two classes were higher than a given threshold. The 
rationale behind this method can be explained as ■ follows . Let' us 
.assume that classps and are the possible choices for a primary • 
and secondary class respectively. Then a Venn diagram representation . • 
.of the keywords belonging to these classes can be portrayed .by 
Figi&e 3.3. Region -I represent,! words which excluiively belong to clasj 
Cj^, regioh II represents Words which have noB*default probability '"' " 
values for^ both -classes and C^, and region III represants words'. which . ' 
belbng exclusively to class C^. Let n^,n^ and be eqiial to the " • ' 

number of. keywdraa in region^ I, ; II ahd 11.1 respectively, . Then the/ ■ 
correlation Soefficient between ;classes Cj_ a^^^ 




(3.19) 



ERIC 



57 




II III 



50 



ERIC 



; ■ Figure . 3*3 Representation of Cprrelatibn Between Two Claaee^ . 

^ If 'the value of ■pCC^^Cg ) is^ hl^li^fchen / it is .likely that a givfn 
; : dpcum^nt may belong to .both .tJbese: ;q^'asses • If 5. however , is low 
and n^ and n^ are 'highs then" p (C^^CgX have .a small value- In , 

such a caae, will, ne^ver be aaBigned. as a secondary class: even 
^ Vthoiigh. a dociMent may cdntmih keih^oMs -oa^^ 

This suggests tha*/iris^tod of-uBirigv^a^b^ 
. ^ method ehQuid be devised which' will ' be able to" separate groups pf - , 



■5^ 
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keywords occvaring from regions I and III and analyse them separately, 
to obtain primary and secondary classes. . The next t=hapter addresses ' " 

the problem of isolating noisy keywords and identifying clusters 
;of similar keywords. It defines a distance measure, called -the • . 
Bayesian distanca^. and shpws how its properties can be utilized to ^ 
design a technique for handliiig these probl^ / ■ 
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' THE MYESIAN DISTANCE ^ 

, Eie sequential method that hai been described in Chapter. Ill V 
is fast .and fairly ageuratd .method for doeiment^qlassifiemtiori; 
Ho'^eyer, in^sectioji 3.7 whera some of its limitations were .jriotedsro 
it was point ed*^out how the si^Uentlal method may be vulnwable to 
the occiirrenee^'O^^ttoisy : keywords and how. it- fails to achieve^ 
eebonde^^' classification in a systematic fashion* It wai shown ^ 
that if an appropriate distance measure could be dufi^hedj than - 
noisy keywords could be 4^olated and elusters of similar keyv/ords 

.^eojild be identified. These keywords could then be analysed, to 
bbtain ,, primary and secondfry classes for a document* , This, chapter ^ 
defines such a distance /m^asiire and outlines' some of its propei!*ties. 

.jBtjpjpre this is done , however V some req^uired concepts ore de(*inad. 

Good 'Keyword ! It was noted in Chapter III that a kejrword has an ^ 
associated^set of probabilities represented as k,i [p^, Po* P^^] 

If k, occurs in a document then the quantity p, i which denotes 

\ " ^''^ " ' J,. ' ^ ■ " 

p(k. /C measwes the strength with which this . keyword relates the 
document to class C , Suppose a doamnent belongs to class C* .Then; 



>a_kayvord contained in thie dQa^ent 1b a good k^yyord if p, is - v - 
great^a? than or equal to any of the other probability .values aaeoGiated' 

with k. , • ^ : ' ' ^ ^ 

Woigy Keyword r Referring to the discusilon above ^ keyword k^ le a 
noisy keyword If therereki sl^S; probrt .assdeiated with k^^ ; ^ 

which are greater than J - ■ ' " ■ ; .. "• - ' 

falmgdry '^and Seoondaiy ClaBSee i Very oftenj based on its subject 
content^ a document may be classified into more than one. class, say • 
C/ and C.. If this is the case then this documeijt should "cbntaiii ^ 
teywords which are., indicative of both class C; and C . . - This may 
happen in three ways . ^ ^ V ' ^ 

(i) The document may contain a group of /^keywords- ' ■ 
"'.{k^./kg^ ■ • * i^jj} such that their probabiiity components 
- • ;•■ . . ; • ; f or VC . and C are predomirfantly higher than for the • 
other classes . - \ 
* . ■ (ii) The document may contain groups of keywords 

^;V' {k. ^ik. ^.p, . , sl^y } and {k ^k, -j^^^sk, J such that 

th6 first group is a set of good.kfeywords for class 
V ■ ; C the second group ' is a set of good keywords 

foi; class C . ^ ^ ! ■ 
. (iii) The document may contg^in. keywords which fit into both 
ceftagorles (1); and (ii). Based on an analysis of 
these keywords it may be determined that the document 



should be classified into both C, and C . If the 

1 . j 



ytforde £^e more indieative of C. than of C;.! then 



,vf-. ■ . 

* " 'C. is denoted as the primary glass and C, as "the * 

^ secondajy clai s. \ . 

, Thig chapter, deals with the definition of, a measure which ,ia 

=^■..1.^ i. ' ■. ^"^ ^ ^ , / — ■ . ■ . , ■ - ■ ' ^ . . ; : ^ ^ ■■ .- " . -■■ •■ ,.- 

.. , ^ . . . .. . . . . , . _ ^ . , . . . ^ 

able t© diolat a groups of words which are ^similar in nature in the 
crSens-e that \^ordi in the same group are all indicative of a particular 
claee.v. More^peqifically^ a distance function is defined which is 
able to isolate nois^ ke5rwordi f rowT'^the /good Hfywirde and discard ^ 
the noisy ones from coniiderat4M .wft|.l& obtaining a^ prlma^ 
. Noisjr keywords so isolated may then ^be vMnaly zed determine whether 
they indicate the fact that the document may be assigned a secondary 
olaes'* . ' ■ ' ^ ^ . ' ■■ : ■ \ . ■ . ' ', 

Thli distance measure J called the Bayeslan distancev %as been 
defined on a t-dimensional, v^ct^r spabe which may be used to 
represent, keywords. ; Thl^ next sectlon^di^cuases such a representation 

U.l Vaotor Representatien o^ Keywords \vjiV!^ V , 

The keyword ^ set representation for the eequeiitial mathod 
'conaists of a. matrix -of aonditional probabilities » Each of these 
probability values represents a numerical measure of the extent to 
which a keyword describes a particiJlar class. This matrix5 referred' 
to as the conditional probability matrix CP^ was intrdduced in 
section 3*2 of the previous. chapter • 



matrix the' follb^ng is obtained! - ^ . = 



where 




ABj0)^e,^^fiq)i^iit^^ method reads doeuments - and extraota keywords , 



It uees only these probabilities aeaoeiated vith V ke^ntord^^^^^^ 



calau^'at#the|{aj values for the set of claaBes. Therefore, for the 
■ - v pli^posee of. elf gslficatloa, a keyword' can te' retresentitd'hy. a 
t-dimraslonal vector.. Two, words, even .though they are diitinet, 
wlll.have the saffie -effect as far as the'^robalility calculations 
are concerned if the prohabllity vectors associated ^ith them are 
-..the saae. ; By the same Jbokeiis two words ' with unequal probability 
•v^^-'K^?*te^ .'^ill.h^ve different efActa in the a^'poiteriorl probability 



caloulktlons 



■',Kilife.L'..,..,vf^\' A keyword therafore\4an>'be represented as a point in a 
^\ ^ ■ ^ ■ • ^ - ^'vr ^ ■ ■ : . p . ' \ ^ ^ 

t-dimensional y^ctorV apace \mere each component ^ of subh. a veetor 1^ 

a munber -between zero .and unity* Henqe, a document which" contains' 

several such keywords |can be represented as a collection of points 

' in this ppace* The next sect ibn^^^ idea-. 



i^, 2 Vector - Space Representation of Docimient 

A document 5 for the piarpoaes of classification, ?l,s a group 
Of keyyorda {k^^- k^, . . , , k^}. As nojbed earlier, each of these ■ 



■ 



keywords can te represented as a point in a- t-dimensidnal space and \ 
the ■r^la'ti^nship "between ea be studied* For the' iake . ' 

of aimplicityj let us consider a 3-^Glass case and a dooiment :that ' 

, hai a tot All of seven keywords* "These seven points in ■ a. 3-dimehsiofial 

' ^ . ■ - . . ■ . . . ■ \ ^ 

space may "be visualised aa shown in Figure. ^.1. Suppose that in 

this tjwee-clais situations we were to repreiept ey#iy ke^nford in'i' 
. ■ = ■ . "N^ ■ . . ■ ■ '■■ ■ . 

the system^ i*©*-, the, complete set ^of keywords K -J.^^^^ k^, k : 

in this 3^dimensional space. "Each keyword k^ will hax^ ,^rfn aiso.e'iated 
pro'ba'bility vector of\the foUn: ' . " ^ — ^ ^ ' - . ' 

TOien three i distinct' groups of keyj/ordSj 0^*5. G^^ 'spuld lie- ^ ^ 
Identified* ^ ^ = . . y^s"^-v: \ 

(i) represents all keywords *f or which is greater ^ 

than the default .probability value. , / ; 1 

■ ^ :r ■ . ' : : r^' ' . ■ ^ . ■ - \ ' •■ . ' ^ . ■'■^ , ..: 

(ii) G represents all k^^ords for which is grwter V 
J - ^y^.' than the default probability value* -v ^ .^^ 

■:(iil). Ggjrej)r^sents all keywords' for -which is greater ' 
: _ tlifcn the defaiilt^. probability value. ".^ ■ 

A keyword depending on its probability components/ may belong to;^ ■ " 
any onej any two or all three of these gro\;^s.^ In Figure ^4,1,; ; 
^1* and Gg have been drawn to represent th^ese groups* 

Looking at the figures we can say intuitively that the ■ " 

dOGwhent belonge to class with keywords /k^, k^s k^s kj^ and'^^^' 
totally or partially' representativtr of that; c]^ss,^ In othw words . 
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Figure. 1^,1/ Vector Space Representation of a Doeiwaent 



these keyvroirds a^e. indicative of claas because they have 'aVnon^ 
dafault prohability component for that classji" ^They cto be said to 
torm a c Mater of eimlla^ words. Assuming that the* document* ddes - 
belong to class C^? kg and aj^e definitely ^holiy' ^orde hmmiB& 
they have default probability values for clais ^ and- are mare, 
indicative, of classes: 0^ and respectively 2;<7?^In. othir w k. , k^^ 
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^3 5 ^1^9 acre eimlax to each other than they are to kg 

.. and k^* If this faet couia be recognised diirii^ Glassif ication^ ^ ^ 

thenkg.and 1^ could he dlscai-ded. ftpom consider at ion- while qhtaining 
''^ a primary class ^\ * ' - - " 

As another example^ inBtead of having one group of eimllte- 
keyr^orda^ a 'repeientation o of the form 

; shown iu Fi^a U,2. Here, we -not ice that two distinct cluatera of 
keywords Qccur Ihe first groups denoted ."^ I ^ = Has keywords which 

, are -exdiiigively indicative of cloBS- C !niese may ' be 'used tp; ' 
obtain a 'primary class. Group II oontdins 'Kords which haVe^aon- 
default probability cori^onents for ciaae Cg^ and^ay ;be used 'to v > 
obtain a seephdary q^Was., . Again, intuitively we ban' eef that words " 
in group/'t- are similar to eaph other and dlsatollar from' words/-,ln 
g^^o^ip-'Il, . " A, measure capable' of* faol^-ting ' ^ueb^rgroups ■ot^Sh^^Hk^ . 

-doulft.^aoMave^ primary and seeondary claBsific¥tion. ^ ' '^/^ ^'"'V- ' 

i*^3 Ooriv e ntlo na X^'Pl Btanbe 'M :^ " \ ■ 

■" . . ' ■ / • ■ ■ ^. " 

The ^discusBipn in the previous section: has .pointed but -the , 
need for a similarity or distance measure which will be' able to ^ 
isqia,te;noiBy keywords and Identify gr^ of similar keywordi^i-^ in 
this section we discuss some of the conventional dlstaiice mtCsures// 
that '^liave been used to compute s^lmllarlty between keywords, /.'It is 
also shown why these measures are inadegua^e to handle the two 
requirements mentioned^ above, thereby establishing -the need" for a ■ 
naw/ distance measure. . - .\ - " ' " , , ' ' 



. = .. •-.'^V /= - 




Ftgm*e U . 2 Cluiterl of Ke^^orde 



. . One i^f the awliest miae^mee n used ie the, ^^^^ 

mesLB^e [22];; defined for binary vectors , : Suppose v and are ' 

^ ■ ■ -■ ' ■ ■ . ^ ' ' ' ■ ' ' _ ■■ ' _ . 

t-dimen^ional binary vectors ^ theri the correlation r betwaen^the 



t¥© veeiors ie given bjr 



(it.i) 
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i . ^ threshQlcL value* rKiis/ might prove to te veiy wasteful of%;ihfoi^f,tion* 

; - ^ Anothfer 4l;aBavantagte^ iBv-^^at; this "peisu^e iridioateitpnly the similarity^ : v ^ 7 . 

• , and %ot ^ttm. dia^i^^larity^feetweM . For ihst^ie s the ; 7 ^^ . i- ,^ ; 

",; .■tKQ;yfeetor;'^paira..j-;;;:^ ' y ■. ■' ' .: 7^^A ^. ^^v' 

... \ -ana. ■ ;.-:,i^,';^V ^_ ■ ^ -'^ "'-■■■I: : , J:M::^- ■ . -■ ' 

. hwp ;rfc^e- ;Sipfe^ almllarW/ j:^Lu# ^^y ^ho^^ft) the^r si»e ■donsideyah]^ ; ; ^ ^ >g 

. ■■^different-.v- ' . ^ ^ ^^R^ i^^^' ' - '"^^''v ^ ' ./ .7' .• 

^ ' \ ; A Bipre^ comprehensiva measwe i ^ioilne eorrelation meaaiire ' " ^ /' ' 

; [22], THiB hak been need in many syit^s ^ and apeolf ical^y In/the ; 

. ^ SMART retrieyal systemV fhis>Aea^tirf :|lfes^a gt i". . 

; y; ; TO^p j if V and v:^a3re.*tTO; vecto ^ " ^ 

the OQSlne of the. ajagla between t^iM ^ ^ / ^'r ■ 

. where the nuDi^rator is th© sq^ar troduct jiof the two vac tore and r ^ ^ . V 

the danomlnWor is the prodUGt^ of' their magnitudes,. Our . proWem ' / - . v - 
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bo^da^ the^Bhould ^be "cloeer" to eaqh other than tvo vectoi'S'' ' 
which lie. in dlffererffc clasaes^ ^wen thoiagh the angle .suhtendtd 
biWe^^tftem is ^^he mmm* The cosloe measure falls to. aehi eve this. 
, Anothei* raeaiiare that has beeri generally used to flq^ the 
: distahce "betweeri poJitts Im a Treetor space la the Eucliaean distance 
.eaeure. Supppee ve ha^ltwp keyword ireetprs. ;^ . ; 

tj^ • ■ ip^', %, Pg] ' 1 

iithen the. Euclidean"' aiataiiQe ,D( ]^.). b;etw^ these vgctors is 

Fen • ^ , . . ■ . ' 

-rf'^ ■ : . ^ ^ _ '3 - ' ^ — ■ ■ ■ ^ 

'Why such a measure. iB/inadeguate for piirposes is Wst Illustrated 
by 'an axamplei, Suppose we have tvo Jte^rwords-K. . and with aslBOclated 



vectore , = . „ ' . ' : - ' ' \ ' 

;/ /. \ : - [O'Tp o/oi, 0. 01] 



^en 1(^25 ^ .'0*5< If inpteads we had two ke^drds kg- and ik^ 
with Yectors ; * 



V k»r io.di, 0.01, 0,31] : i^'^*';^^- 

twn %ain ^(i^- k^) i'o:5^ : , / ' ^ ' ;^ / ? n 



Suppose^no'w that the a posterioi'l proTsfthllitles , 'thS/ a^* '. t 



values, for clasies C , C , •are calcul,flted using key|eoE^^ 
K^ , and kj^ indivlduflly,r .using eauatiori (3,9) of the ppj^^tis ghAptpr . 



. ;We assume tlmt , the a priori probahlllties of the cl^im^^M^-a-tim -.^ ' ■ 
same. The se^'a,-' values are shown In Tahle -ve^-ski^^^: -tiiiii^ 

valuei in table 4.1 that k and k are both more hSgh-^^|afe'M"^e": . 
of class thfew the other claiseB^ On thtf othmr i^^^M^ 'fi'' h '^K 
points vboward ttlaes Cg, and ki le clearly indiqa|iye.^P^q^% . 
C«. Therefore^, our requirements a distanae 'i^MkMrn^^^^' ' %^ • Mi 
that k and be identified as teing 'eloaer ^^^^'^aQh biJiiea^^han t'^^" ^ , 

jkg and k^^. The ^elidean distance measure .fails to dcTth^e^f/^ke " - ^ ^ 

• • ^ ■ ^ ■ ^ : . ■ /m-S- ^ • • ^/ 

fQllowing section defines the Bayesian distance vhieh^to ^ " ^ - 

coi^idere :the magnitudes^W each protebllity componeirt;^^ , ^ 

but alBa information about how a k€5nford is related tb C'kivtn clais, ^ 

■ • - ^ ■ ■ - ,\ - ^' ■ . ; . - • ■ 

; , . , V '■ ^'^^ ■ ■ ; ' . ■ 

Definition Of the Bayesian Bietance . ^ = ^ 

Suppoie a dooumenfl la to "be classified into one class ctf .a- 

^^yset of plasses , Cg , , * , , i%t ^a -^ven ^ sjtfkge of the' se^ential 
. ■ ~ " ■ ' .^r".^ ' vf 'f*' ■ ■ ' 

proceM, suppose ve har^e re^ i 'keywords k^,-k^, k, / Theh the 

■ ■ ' ••= ' ■ ■ ■ ■ ■ . " ^ " =i ■ 

a;poBteri|i*^. probahilitiei of eac these classes is" denoted by 
^(Cj/k^, kgi;^;^^.,, 5 l^), where J varies from l#o t", . For ataplicity 
this will |fe wrlttM as P(Cj/y) ^ where y denotes the OQcurrerice of N i 
keywof ds' k^ , Then the a posteriori probabilities of all the 
ciatfses after observation ?y can be represented ^ the follpwing vector; 



. 70 



63 



1 J 



Table A Posterior iV^robabili of Classes 



• ■■ - 


^ ' . • ; ^ 
' '( ■• 




BYWORD ^ 


^1 










i 










.f 'f ' ■ ■' * 


V. , f ■ , 

' -. 








o.m5* 


" a.;da:5 














i «=..__ 


■ ■ • ' ^ ■« ' 






0.91 




,0.0it5' 








' . ^3 ' ' 


■ 0.025,.' '■ 


0.95 


0.025 
























P.!03 ■, ' , 


0.03 «\ 


;; o.gi* 


^^^' \ 1 






















4 ■ " ' 

- 







It can "be noted here that^^ pC-Q./y) Is/simply ihe, a. oomputed bjr 
equation {3*9) ot the previous chapter after keywords to k/ 



"have been readi 



Deflpltlon; *The Bayesian diatanee on the probability Spa'ae -jof' a 
set of ^olaeses after, an obee^vation y^. i*e*s after i kej^ords" have 
been proves ?ed ii D^(C/y) - [Mag(i) / Dip(i)j . ^Mag(i ) and DirCi) ^ 
Tepresent the ffi^hitude and direction, of respectively, -ant are . 



■ ■ ■ - ' ■ - ■ ^ '\.. . '■ ■ 

plr(i) ^ r 3 .P(c^/y|*^>'p(GWyh' J ^ 1, 2, . , t.: 

Dir(i) ia the index of the dlass having the highest a poBteriorl 
probability after i keywords have been" read* ^ ' 
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6k 



At this point 5^_^foxe^_nQ±ing-J^ 



dietance measwe for the purpoge of clas.sificatiohV it la vorthwW^ 
/ to explore its relationil^i|^th some* conoepte of intereat in" 



olaeslf iaat ion theor^mwJ^ 



'k , 3 ^ Classlf Idatlon iS'ro^^S 



Since, we. Eure: using the Bayes cteiitiorial relation 




,tUe^;GijaiBes .at each 
^ir^^l iiaifit at ion 



ta obtain the p. poster lor i protaDiill 
stage; i it ISv^eassr, to /ototaih an:-exi 
;|rror. At MiQ^^tmp^-pf olBsm probabil-^ 
Ities , hei:ms^/y) :P(Cg/y)^^^.-* .r, Up^/y)f: .Jji^^J Is the elass 
chosfen , 'thin Hiie etror o^^^^ is given' by 



The error will obviouBly ^e; a mlnlittum blass C- is sueh 



J 



that its a pdstferiori prohahility is higher than all the others, 



lor em k,l 



KTaH;4trivtes the classification error-ifcrfc^SS^S'erTOtion 



l.e.i^^ter i keywords have 'been read j then' 



0- 



S 1 - Itog(i) • ' 
Proof: V Let ^s d^ine' ari indfebc set " 



Mftg(i),^ E_p(pj/y)2 s [»ax{P(c^/y))][;'r[P(c-,/y)]. 



Jcl 



J' 



del 
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MagCl,) s [max{P(c,/y)}] ,s 1 - P„ 



This showB that the magnitude of the Bayesiari distance forms 
an upper' bound on the: olassifl^^ error at any stage of the 

- '6' Infomatlon TheorietlcvMitgure of .the Value of a Keywdrd 



Besidfes^^^^fHSr^^^i^^pih^^d whitfh i 



La tieing propoaed 
theory of 
liflcation error 



Borne of i^ese may; be ad^^^Ss^^^MSro^i^fe 

One such measitfe is the .info^M the value 

of a feature meaeureaent. This can' be ad%ted' t probl^ of ^ 

dociffiithtrC!,iassification as fallows. ; - 

.... j.^;^^-^^^ ^, " . . » . " 

- BSfg^p-any keywords are read, the^^^ppiorl probabllltiee of 

' ■ ' ■ " ^ -. ■ ■ 

the classes are- .. ' • , 



The entrdpy of this prdbahllity ^distrihution is giyen by 



H = E PCC.) log 



1 " 

la 



Aftei''* a keyword lias been read" and the a posteriori prohabilities 



are,2eBlculated -.MBing' Bayes relationship, the new entropy ,H' may be' iitrC^lit^^: 
.calculated as above. Therefore, the reduction in uncertainty -in 
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7,TT-r-the^holce"-D?-a--i4gh^ a measiire of the amount :of information 5^^ -^ 

. I^- given by this keyword. It is given by equation (^.^). . , ' 



■''■"''The^sfbr^^^^ fd^&. given H theXqwer the Yftlufe* of HS' the grater li 

■ the information given by the keyword. ■ It can be' shown that, for a 

;^ S^claiS: pr^ % £H'- ^ .Thus, .like the Bayesi|n distance, the ■ 

!:>i.4 , \^ entropy measure also foms aji. upper bound on the "^error of ■ claeslfl- 
■ cation* .However^ it can be proven that the magnitude : of the " ■ " 

^ Bayesia|i^ diiiance forms a tighter upper bound than the entropy 

^" ■ ■ ^ ^ , = \ . ■ ^ m ■ ■ . . ■ ^ 

■ , measure, ^ v^ ' . . • • •••• . V • • •• . 

' \ - ■ . ' Consider two alaasta C^^- and . Let represeiit the prob^. 
, abililiy of class at any st^e; then 1 - P^:. is the probability' ' 
^ ■ of class Cp, We can then compute the, values of -"elasaification ' 
errors Sg'j the Bayesian distance bounds 1 - Mag(i)i .and 
the entropy bounds i.e^^ "iH^, forvf,rious value s.o£ P^. Table^^Sfi-^g' 
' presents these quant it at five different values ^fS.P^ and Fllur a 

■ ^.3^ represent g -them grapHicaily* These upper- b_punds. have been^ 
demonstrated for a /two elass problem* JtV ean b6 proven that "as the / 
' ; number of classes increaseis the quality of the Bayesian '^distance ." 
/.^ -as M;^a^ the elassification error does not 

' degrade very much/* Thet is stated as^ a thbdrem and proven in* . ^ 
' ' Appendix A. . ^ ^ . ^ 



Tableij-i:2- Upper Bounds on Classification Error 



Pi-;. 




1 -.Mag(l) 


■iH- 


p ■■ 


■ 0 


' 0 ' ; 


0 


0.25 


0.250,- 


0.307 


; o.i+o6\ 


p. 5 


0.500' ■ 


. 0.500- 


' 0.500 




0.250,' 


- 0.307:'., 


o.'4of3 


'1 


0 


. 0 : 


■ 0 . J; 




4,7 Use of Jayesian Dist.anoe In Claislfleation ^-u. " * 

:" " In the previous eectidns a definition of Bayesian dietence . 
lias been given; and some d vith respeet to claaaifi- / ' 

cation theory have been studied* In thli sddtion it .will "be shown 
why the BayesiM distance is an appropriate tool to handle the ■ , 
'pi^ohlems that have been outlined. . . 

. Two of the. xe^lic^em&iits of a distance measure uied for the 
purposes of classiflca^im^fi^ as * 

(1) the dlst^de jSeafeUi^ "shoui 

keywords^ iahd ^ / ' ^ ^ ^ 

. (11) It should be able to Identify, clusters of siniilar^v 

words or words that ^e predpminantly -indicative of 

one elas^, ■ . ^ 

■ . N ■ " ; ■ . ■ .1:^- ' ; ■ ' ' =. : J'.%,_ " 

Suppose a ke^ord k_ has been read from , a document V Tfien t*!^ a . ^ 

pQsterlori probablilties of each of the classes ^ K(C, )'p Is "given by 

the values that are caiculated at eaeh stage ©fv the se^ueiitlal 

processr \ Let the-set- of^ * ^ . " 

Let I repreeent the Index setK{lj a,^**;, t'>r ^If ^s tha hig 

a. value at thls^polnt, than the magnitude and dlrectioii of the^ 

^ . -.^ ^ -.'^ ^ ^ ■ ; . ^ ".- \ \ , ' ■ ' 

Bayesian distance -D Is glvtn by . . ; _ 

■ V ' .» Mag(l) (a.)^ ■/ ... ' . ' ' 



We note ^^hat Mag:(l) is the |m of thri si^i^rfS of a set of numbers^ 
wiiibh sum 'to unity,/ Let x {x^^^ Xg?" x^J denote such a set. 

The sum of the squares is then given by 

• • ' t ' : ■ \ : 

\ Let X, he such that > % x . ^ where x^ is now Increased by 



a quantity (S *and the other ^^re dfcrea^ in deiired way suoh 
that "we havm t neff set of 'num^ri*' 



y^ " + 6 ■ ajid E.y^ 



Jel 



Then the new sum of squares is given hy 



Then S is graaJiK^ than S . A proof of this fact will he given in 
Appendix. Bp 



Therefore if a is such that , ' _ *^ .j^ r 

■"\ . . ^ ' ^ " ' ^ " ' ■ 

then an Increase in' will' inc^ the value of Mag(i)^ without 
changing DirCi')- It should be noted here that this W. a iuffioient' 
condition and not: 'a nmomBBxy one, / ■ * . 

^ - Since a^ li, the highest otj value, it implies that prohabl^ty 
component of the vector associated with /keyword iy higher than. 



Lihe^thei* d^monente .of the vectoi^V jpherefore If ' the dOGijtoent 
,dcies Indeed 'belong to clasp C^/ then i^^^g|Qd v. - . 

^ Now suppose kmyi^Qrd kg is read %hi-d^%'ai an associated ^' 
protabiiity veclJpr as follows ' ^ ■ 

If kejrvord i^s also more highly fndlcativi'^of blase C .than thfe; 

■ ' ^ . . ; ■ ^ ^ ^ . . ■ ^ 

other claiSeSj then the. q eomporjent will fee; lilghers tlah the other 

con^onentBp.: The new set of a. values is^given by ". 



The new Bayesian diatariQe is 



Mag(2) t (a^f 



'Dir(2)^'l 



e value of Mag(2J will be greater thaa^:^g(l) If - 



• ^ Th 

exceeds a But an Increa^d ot^* means that with. the Qccurr^c 
keyword kg the confidence in class. C^^being the correct ^ass has 
increased. Therefore k^ can be considtred to be a good keyword. If 
instead Mag(2) decreases or the direction changes., k^ cm be ' 
identified as a noisy word In: relation to H-^ and hence isolated. 



Thus at each stage of the , sequential process, depending on the 
nature of variation "of the Bayesian distance magnitude and direction^ 
a keyword can be labeled as either good or nolay* The good keywords 
can then be analy.^ed to obtain a primary class ^ and the noisy 
keywords 5 which are noisy with respect to this priinary class, can 



•be anaJ.y^ed. to yield a-poss^ clasi* ' ! 

. . Kie ;ne3^i;;^chapter providee aii experimental Verification pf 
tlie Qlaima,^de in this iectiph. Also the fact tKiat the Bayesiw - 
diitance magnitude inereases with the qccurrence of a good, kej^vord 
' ¥ill «he proved rigorouily Jor ,a three ^ qIMb probleJti*; ^ Finally it \ 
will he Bhpwn^ how the gqod aAd noi^sy keywords can he effeotlvely 

■■ . - . - ' .... ■ - : 

Bepara,tpd and analysed to yield primary and secondary claBiification. 



: - 'n 
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' ' ■ .CHAPTER -V " ."'I-:' 

ilPPLICAtf ION OF THE BAYESM DISTANCE 

' r ^ - ^ ' . - ^' . . 

■ IH the last chapter the Qon^^ of a distance meaiure on the 
Jceyl^rd' gpiipe 'waa intrpdueea, A^pecific ^dlstanae meaatire, caljed 

;%tha BayeeiTO distBnce, ; was defin^^^ and some of Ita properties with 
respect to classification theory were inves.tigate^d* In this chapter 

,tha use of the fi^^^iain distance for the purpose of dociunent ^ -. 

^^elassifiG.i,Wfln Vl^^^^^died.' It will be shown that by' noting 



y = the "variat^A ' df^^^^^^^PidM^^^^^ directions of the Bayesian ' 

distance with ^M'^5^|!4rf^re^^^^ can be detected 'which 

. co:rrespQfid feosnoifey^-words ana gSod words* Also ^ "a distinction can 
fi , V 'he made ^ttween ideal and nohWideal docutoenAv. ' Saeulte of the 

experimentia Wrificatim of several .hypotheses will presented 
^ and in sgae^ea^es a mabhemBttical Justlfiqation'wlll h^ given. - 

5.1' yariat ion of Ba y feslari Dietance with Ke ywdrdi 
\ \. . ^n order 'to test the yalldity^of., the concept of Bayesian 
distance and -how " it .will gerfom in the caie of an ^actual data 
base^ a r^et C5f experimerits was cdnd\icted/on the SPIN /data, hase,^ 
, The purpose of the experiments %rm to note ,tha TOriatlon tff the: 
me^griitude, and' direction of the Bayesian distance with^ch keyword 



read from a given ^ociiment."- " V ' ' 

Although the dharacteAitlGie. ar^- deta^ descrlTblng the SPIN, 
data "baee are descplhed^iln^ [sSlV'the tasie root claisee ojf fePIN ape ' . 
reproduced here^for'ease in Wd,erstanaing the results obtained. ^ 

■ ; ^ ' . ■ ^ ' ^ ^ - . ^ • ^ z^' . ■ . . . " " ^ ' 

Baob 'dODUment in SBIN ma^y be clasSiflei^^lnto. one^or more df^he seven ^ 
ciaises listed in "Table 5.1. A nujneriQal^ cq^eV whleh will he \ ■ ^ 

referenced later in thia chajftrjC is indicated in parentheses, 

■ ' ' ■ - ~ ■ ^ ■ ; ^ * ■ ' ^ ■ 

* , ^ . ■ , . ^ ■ ; . ^ I- 1, 

. . ¥ / ^ ^ ^' Table^ 3:1 Spfffi Root Classea 



•'1 CLASSIFICATION : 

, ' code 






, B (o) . : : ■ 




.General, - ' : ' . \ . . ; ' ^* 


■ : -b ci) . . 




^ High energy and nuclear p^^ca • . ' . 


,y M -(aX ' ■.. 


y 


^AtOTSj molequleSj/and cheiiical physics 






' ^ulds and^. plasmas . - . ' ' 






Solid state ^physics ^ ■ ; ' . * 


.;■ T C5)r . 


•* 


' Acoustics 5 optihsi and prosa^ ' " 
^ disciplinary physics / ' 






Astronomy and astrophysics 






— r -- , ,-,v'-":.. ■' L.:"^..- ^ 



Eac^ dacument was read entirely and a liBt of all the keyworde 
mt was forajed. Let this list be repfF*^^^ "^y"" 



yrh^m '^^ iB'th^^L - keyword occurrihg in the docTOent. Each! kiy%rd|d 

Tr . ^ * - \ ; ■ . ■ ' : - ' : . . i:- ^ ' ^ ^'C.^g 

1$ a member of the keyword set K and has^^an associated ^pr^abiM# 

y .■- • : ^ , : ■ \ r- , -^^ft. 

Let .j*epreient the A/p0Siterid?^~probabilitles of the.' clasees. ■ ■ ' ; . 

after the i vkeyw:Grd; In Chapter HI, methods, for th^ir ■ . 

i ■ ^ V . ■ ■ ■ ■ ^- . ' ,\ / - . ^ ^ . ■ ' ^i" 

caleulation were^ discussed/ The Bayesian distance- (h ) magnitude ' ' . 




aj^direatiqn are ejaculated using thfese values. ^ Let Mag(i) 
demote t^e magnitud^^ and Mrti)" denote the. directioh of after the 
' 1 kejrword ^ia prooessed*^^- Then • ' ^ ' / i^iK' ^;.^ . / ■ 

•and Dir(i)\is,the index'ofvthe highest value* Mag(l) and Dir(i) 
are c^cy^ated for each valye of 1 "ranging from^.l to^ n\ Before; the 
reeulta Qf these experiments are reporte^: several Herjns need tdi^be 
/defined. : Ihe dtefinitions -of good keyword and^ noisy .keywbrd ^ ^given 
in Chapter IV-5 are repeated here for contigu^^ - . = ^ ■ 

Good ^ayrord.^ r ^ring th^^process • of- elassifi^atibn^a kej^ord ie ^ 
cqn^idered to he ;a good fc^yword if its 'highest probability nom^^nWnt \ 
helprigs to 'the claps^td jrtiich thei^pciiment belongsL i lii this c^se ; ' / 
the key^rd reJ^ps -the document, ^a the^ indicated cllss mc^^ 
■than to any^otiie^^ ^ . ' ' ' ' ''-w o - ' ■ 



Noisy Keyword: ^ Purine the process. of ciassifi^atlon a keyword is 



. considered to b'tf-' k noisy keyword if Its hlghgsf probability cpmpojient- 
points to a class «ther than a class to irhiph the document belongs. , 

'.ideal Dociiment: A document is eonsid'epea'' to' be, ideal ^f. e^er.\i; ' 
ke^ord contained in/a;^/is, good, ^ . ! V - : ^ ^ 

Won-Ideal Bocui^ent: A ffcument is considered %o be- non-ideal :if ' ' 
^ ,tne^y IB at - least onef keyi/or i^-n it which is' noisy r ir ^x^^ - . 

■ ^ U It Should be , noted here that these .nitions presuppoie that • 
the class t6phich a document belongs is knowii-. . ^ring classlf icatiom- 
however, the class membersKip df a do^iSnent is not known un^il the" '. 

.end of the sequential. process". . A keyword that ^ ejEtrp-cted ttom a 
document at, each step ofthis 'proeeBs is tagged as either good or 
noisy. ■This^feciaioif depends dnir on tfe klywords 'that have "f receded 

*hls particiaar kejrwbrd, and ma^ change as more of tli|^ocument ig' v .* • 

examined. This point is olarified l>ater in the' chapter . ■ ' 

■Based on;these cpncepts and on the'reBultB°of -the 'Bayeaiari* ■ ^ " ■ 

distance values', the complete set of documents .was* divided into a ^ V- 

set of Ideal dpcuments iid a set ' of non-ideal ones. • Sample documents . 

■ ■ - . ; ' ' . y - ■-■ ■ • : ■ t " :• - - - ' 

from each set were , obtained and exainlne& separately to stu^ the %' ' 

nature pf the yariation of Bajfesian dlMnce/with" keywords . * ' ^ « 

5.2 Analysis :of IdealVDooumerits . ■ - - '' • \ ■ . * _ 

>■ ,, '. ' V ■ ' ^ r~T~ , . ■ ■ , -. 

., • . twenty documents were •s.tflected 'from the ' set of ideal documents ; 
for' detailei study. In air cases ' the /D^'magrtitude uncreaiL, with . 




, the,niffliber Qt k^ord^ T^Bmmi^:.%l xt r^idhieS br closely apprdaehis =^ ; 

a value , of imity, tSe direbtion ;being constant over, the entire 

= /'^S^p' ■ • • ■- • r ' ' '-'^ -^^ "i ' ^ 

This phenomenon ie^Allustrate&in. Figure 5 ;1. '>Aiis /fac'K^ -^^ ' 

. -= ^^J^^ia^tie titili^ea^ to. iirecogni^e dodiMientB aji^e ideal in nature ■ ' . 



and. GlassifV them: into desigriated by the direction of D : 



ane imct^ tl^ euch^ Situation can also occur for non-^ideal documents ^^JM-^v- - = 



will he dl^^ised^: later. '^^^A -. ^ ^ - ^ . ^ 

/ The.^ nature of the variatioh of the magnitude Of "D- - has been ' 

utilised 1<o design . a classifier which such a pattern in 

.the B^eaiap distancee pbtained by procegiing the^ktyvorde' In a tes^: 
(^diment. Figure 5a illustrates ho^J^ D^ - 
-a^roaches the- value of unity ovmr qj^K^^£ -^^yi^oriB, ' The magAl^ , 
tudf of D at Wy stage of the; segu^^^^^^opem^ 



. set>r cfjSmiues; taj|, oLq' %}-^pmButea^at^ €tage / ^'ihe " ^ 

. ' ^ - . ^ ^ - ■ "-^ ^ - ^ - ^ : -. 'j/. - ' ' ^ : . 

;pnly jfey that^ the magnitrude can assume: -is whdn one^^ 

^^^^^^^^ is. unity and the rest are all ^^ero, S^ppse n' keywords ^ 

V ..' ^ V -'^ ^ ^ . 

^ , Wg^: \ * . V ^7 haye; been-read^ . XWithout^ los s of ^generality , let^ 

us as:s\^e that each ^^^drd W^' h^ an associated ;pobability ye^^ 

of tHe^ fo|m [p^, d"^i;^Vi, d],. where-d is the ddtault^p^^ ^ 

value, Hent^ej 'each ofV^^ese keywords .is^ strongly indicative of ctess 

; C - If the iOt,^ valuM tee computed ;using these n keyw^ds then tfe ^ 

vaAl Of : o^: lsrgiven:by ^^ ^ \ ^ ' V " : ' - v ' \ 



7.: 



M: 



n 



IT (P->J 



,a = — ^^Vt — : ^M; . (5.2) 

.■ i=l 7e.«. ' • ^. 



Each of-the oth^Si-i'^^ values, "is given 



:#^.' . . , \ i^i . . 



^H^^ ^ _,BeQau^^;p- ^^^dj as mbre and' ifcr€ keywotdb -of -the sajne f orm arfe ' 

approaches unity and* the other a ' app^pach^Eero, 
^^fl3,p0retiQ$^^ never .reaches *a -value of 'unity . The- 

Magnitude ;bf ^p,ch^^ i^^iven' hy - -Vy "^^^^ ' / % / --^^ 

approached' -a value of unl^y as ^ a - appro^aghas, unity * Theref GLSf|-tMag'( i ) 



^an Be'made as^ close to unity asides ired.;-,^Jr; :pr 6^^^^ 



ffood keywords/* In pract%«^-this Is.inot feaeible becaiiBe docuiriunLd. * ■ : 

. Imy^ ^ limted number of keyS^ Bgsldes ^v if a largo ; niinhei?r'iDf \ '"^f 

' ■ ' * . - = - ■ jf^^^ -' - " * ,^ - / 

Keywords Is read ;.ih order "t6\ make Bayesian/dist^^ mag^iiltude vary / ? 

■ J.' " ^' - ' 'f :'\'."\N^y ':. - " 

olQse^to unity 5 the entire pmrpose of a* sequent ial, t 

- -.been iost|j^ — ^ .^. V; . ' - 

: ■ A praotical impiementa:^ion^therefore'^eq^^^ trf^ ■ dei'lhltlgn' 

'r ■ 'viy ^.- ^ ., ' . - ./ ' ■ ' 

of two parwnaterii One is the '^nt^bef of keywords th jt need to he 
= . ^ examined btef ore. a decision canv "be ^raade., ./This p&iiamfe^er vili^e ^^^ > : ^ '^r^"- 
■ .denoted fy Klje 9iHer i a' a parameter galled the saturation vaJiue \ . 

vv arid will be^denoied M 



^^Ih '.oyd^^th^t -a doaimient may be clapsififfd,^.. - ■ / 
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^^-thp^a^nitiide bf D_ after Q good kewor^de ar^ processed ^hoi^d "be 



^i;^^^atef -t^aa^or equ it wa^ pointed out Chapter IV tMt '^ ^r 



-'^^^^^^^^hfe P4 tognitula forS. an upper bound on 
■ S represents a confide^c© aeyel'. ^ 



t^e;^iror of elassif iaatioh, 
' The higher the valueVc^^Si^ 
- ; '^/^^^fi^^i^eatfT is the Gonfidence that the documeht' belong to jki^:il4ss j : V. : 
■ \ -. ®±ven;lby the Index of m.. ' ■ , / ■ . ' '^^i'^/iV-Vv^^^^ 

These. two parmeteM determined expwimfentally jorra, ' ^ ^^^P 



^; given^ffatmb^se by taMng eainpi^s of ide^l dodtoents and aj^yzAng^^^^r 
■ ■ . Bayesian ^distanee patterna . : S^h ^ eSpC^iments have been; dondu'ctai' 

V ^^.-^or the SPIN aata; betBe and the reBults *are presented in Tata^ ^ .2 / 



^ reeulte 'ar€;4ri41^ howevw^ anbfhei* ImpQriJknt; st^tl^ta^ }: 
^' sbotild be mentiDhed5^^ tJJe B^aslafl -N 

distance • \ Lafc us denote, thta by :B^, ,tfhie ^toitial Talue 'b^ reflecti^ " 
:^ the' of the first l^^drd and' fo^^^^ri ideal document ail of ^ 

^__:\^06e 'fifeywords have pretfominaiitiy^ high\probability^^^^^ ,a single - 

clasSj the higher the' yaWevthe faster :will-^tvi distance . 

[ j 



r^adh^ the saturation ^threshold.^ In a- praGtieal' implemfntati^ ^b,^ 



parmater Q ban ^e' eheaiged bAsed^'on/bhe value of B«J If B_ is lo^. 



ft "eaA bgjiiincr eased be^ 



more keyw^ords ' will; be i*equire^^ rgartf 



) the sWuration thr€tho». ' iS']clarif led ihsthe diseusaion. vhldh 



.f o11qv:s , . 



Tha^ documents in; Table- 5. 2 h^ve ^^^^Mi-^ta^ 1ft' a/s'dtoitt-nA 6Mfer^ 



4! ■. 



8'7 
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Table 5.2 .Arialysls of Ideal Documenii 



^S^TURATION^ VALUE ^ 0.9g. 




4^' 



81 



•i^iceordlng to |he .value of B^y The. third column in thls tabfe ^1 
"^r^ the flimber ot' keywords that are read hafore the fflagnitude of. the D 
■ 'W^®^^- °^ etuals th^valuf of "the paramiti^.S? ; Potf ■Khe results " 
' * S^^fty^^ Tabl^ 5*2 saturation - TOl^e^^ assi^edU ■ As. can ^ 



y ^ ' ' ' . ' ■ . ^^^^^^ 

be; seen fr<OT\the •;tab i^equires; fin^genei^l, 



^> - ,^ ^^that a/grea^r^i^^ of ke5r«'9aidJ 1^%ead tefor^ i4 

pher words , the ^aramete* ^.l^l^uld yaiy^ acoordlng 



obtained. 



to the y§lue:-ora^/ In order ^tarstudy;,:the variation of -bhis^ t 
parametiil, ;the. range of Bj^fpr.^© Idea^dpeuments wai divide* ^ ^ 



into fdi^ intervals* Scpiriiierits were "conducted to 'ti^^tfittt^ 
_ number of , kejworda requi^ to ^eatr^^tura^on ih e&oiPim' 
inteinrAJs., The results preserited ih, Table gVS show. that Me^^ 



.lowerj^.a higher ayera^^ . nmber, of-^.kejrt/'ord.s are r^ad, before 'the %$; 




^^f^io^:yalue Ig. re^^ ^^.During clatfiificatlon/ therefore, b\ eah^be' 
; ueed to determine the ^ J , 



se 



S..3 Analysis pf I|pgSidral :Docuiiients 

As in the case •idefLl,>a^umMt^ , 20. nQn--i4eal'' j^BurtTOts were 
, " ■ M . '• ; . ' 'rV ■ ■ 

ilect^d,.^random and^naiyzed. ThA^^doou&ents ale ' such th'at each 

pneV of .them fcontMns rtfl'leasi 'one npisy. kejrword , For^iar^ty! of ^" 
presentationv only a samplTset ^f ?^he4«Bults obtained is -showii In " 
"table 5.it. Iff colui^B.fcf' Table the rirst valua.lreprBBents■■ 

the' magnituae. of \D| anC th^secpni yalue.^dfestfis. :lti ^airection. : Ihe 




^*2^^;^^®S0i*Ws, which Tms-ideritiJCied' in. Table fe. ^ The- not% keyword f 




directiop. corresponds^to a aamerical. index- as a^ned- to- 'each'of the' m'm 






^^^^^^ 



j6 Table 5.-3 ' Experiments. 




SATURATION 'VA4UE =('0'; 99 ; ^ 



] 



}. 



Intervals for 



f r- 
0 ' ' 



'1-;: 



NO.^ OF DOCUI^NIS 



)BDS REQUIRED 
SATURATION ' 



;«"^I. 



^2 



3 



it.O 



■r 

2.3 



2.^ 



4 
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Table %h ^alpis Noisy Docifflents 



.ft 




fORDS 



3 



7 



^ mm 

^ — ■ ■-■ 
I 1 - 



ma 




P (3) 

D 



f^fi No3% k^n^OTds toe, 



■ v |;m^>■^'«il:^^^fc 



■J 






are iindetlined for ease in identification, . As ean .be sedri from 



a' changfe^ in 



data 5 *bhe OGCiarrence of a noisy kewOTd is- ms^ke^l^? 
magnitude or direct tqtt or both in tltik Baye^^fcdistahce valueB^^^The 




general natwf^of this variation can be studi 
spending upon .whether, only the magnUt'ude' 
and'direction' change 




ntifying two eases ^T'^:- 
or both magnitude 



. Ihis fir^ ty^.is^^pi in Figiire 5-&^ WJthis case the 
■occurrence of a noi^y keyword is. marked by a decrease in magnitude ^ 
the^directiori' remaining the sajne. After the nqioy keyword, the 
. jnagnitUde a^ai^ increases, starting from the following keyword and ' 
reaches saturation as in the ease of an ideal doc^entv ^Thie"'^^ 
iituation is depict ^§ by documehts 1 and.U in 'Table The s^ciDnd 
keywor^ in= document i- is a noisy keyifqM\and hehde th^ magnitude 
drops, from 0*880 to 0*,803* Wherf tha third -and the" foWth keywords 
are; processed^ the magRiijide increases from 0^80 J ^• 



can 



be identified as good keywords* For document! )f the sajne 



\\ . v ^ pfhenomenori is 'observed, except that the |.enoud' and f itth keywords) 
•- -> aje'both noisy ., ., - j - . • ■■' 

; ^ Another effect-^tr^nised byuthq oceiirrenbe= of^t noisy keyword 'ii 

. ^^7.■"^';■^ . . ^ ^^^^ , ;v-^ .^--^^--sl^F ^ ^ * •. 

the change in the direction of the Bfi.yesian distanc^/' Here t^he 

noisy keywords ^ar^ so. strongly biased towards a wiong clgss tha^" a' 



change in^Hdlrectipn occurs j^respecflve of the nature of the .chMige 
. in magnitude ^ This ^|i|uttiori. is depicted by dbcdm^s 2 ,^3 ^and 5^ '^^ ' 
:f-{1^ ^ r "^^^^^ ^ A* "^t^'^PP.^^^fc '3.*.th^ third keyword is very rstrprigly-; 

tf^f A . - ...... ...^ 
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Micat^ve of class C^*/ As a resiij.t, e-^n'thoughTthe first ke5rword 



'vas strongly Indicrative of class C^^ -the direction changes after the 
'^hijr<Lkeyword is .processed. _Ijt_c^an jtherefore 'be isolated as a noisy 
keywofcdv in relation to the first keyword. In docuinerit 5* -after the 
seaond kejrvrord. is ^proceased 5 the direction changes cl^^^yp^ "to C|^* 
^H&ce it ls= considered to he noisy in relation to the^^^rst keyword* 
.^:;^y^=$he dififerenc&' between t^i^^case and the" situationV^t^Kated hy 



^- \a&;cafffen^43-ls that here" the- flr^st keywor.d was not'. 



^'Strongly. 




indtpative- of . olass.^ , f^|nae^ a^change fn" directit^ttrom 0^ ta 
Ci i^'^^^^Siave occm^red even if the second .keytrords^*^pre*' noli -^^^ 
strfMiS^ indicative of class Ci.. . ' . 1\U^'ir - ' . 

sif iepj^^J'be .deBlgned '■^ich vUl detect' sucto vari'atiohfi 

' ■ ■ ■ . ■ . ,4,, ^ 

■iioh, 'ana extract, the KfrJ^orda'^feep'oS'ffl^ 

' ' ' ■ ''" ^ 

causing .J them^ 'Thesi- k«5mords may then he identified'^aa -noisy in • 
relaj^ion to the^ class - cutren^lyNun^er .cOrtsideral/lon .and' discarde 
An efCi'pientj^^feedure ^or implLije^^ing pueh a ^teqhni^ue w^ bo 
dis.cujlfied iD^ the. next chapter." Thfe ^next section wiMr-attampx to 





ical exj).taimt 



ga^tly ^tnitive= and. par/fcl^ mathemat 
it ^ , I ' \ 

ipmena observed experimental^ in this "chapter^ 

^^^^ sKeyword Vectors and Their Re^rfbionah'ip ■ to ."Ba^^esian Distance 

v/irfl^ 




^ 'As discussed in Chapter I 
aan^*) 



rmation provided by a keyv^ord, 
^pJeseiTted by^ a t-d^ensipnal vector where t'- i^. the nuinber 



of 'different^^'C'lassea' in the cl'dBSificati'bh Scheme, Therefore It' is 
'•'■^^^ll^'^Sttte .the relati.o^iip l)e%wean varluug,.- keywaiid 



■0 



ERIC 



S4 



r. 
1 
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yectbr^rorpiB and the effeot they'^have on. the Bayeilan^ distarice*^^' Foi; ^ 
the eaJte; of simplicity a three-oitsa pi-pbleir^ will be corisiderea* . 



Suppose we have read tvo ke^nfords whose veeWl^t^^ shown in ^-n^^ 




Let J -v{1^2^3}^enote a dl^se index sit,"^ Assume that has p as 



DbSte^i< 



;it6 lirgest^QOBipQneAt giving a .high 'k. pbstef ioM prohabimy: f or o r 



e magnitude a^id-directlpn of the - Bay an* d 



given ■ 



J- 



Y ■ l^ir (a^);^^ IV aasming p ; .> p^i^ ';p . i p 




-we calculal^i^^ a/'posterio^i profil^itlei aV of the Classes ' 



^C^ ^and using ^tti^se^^o^ k^ords- 1^ expression is". ^ 



Suppose 
of oil 



^ '1 



a, ,-as k. Is. '. That .ts- •"ImjJjdBfeVit is hqt.JtiT\ie tmi%,':Y a 



and > q,^, In buqA' a case ;,the' ^argeS.t' y wM;! defend:, bn ■ 



lo produots p^q^,. pg,qg^ and p^tj. ' If P^q^ is larger., than bpth ,p^^^^ 



and p^q^, then cx^-will be' l-arger- than, an% . IfVnow.'tae q^^./ ^ 
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GOBpOnent of l^isUnc^reaBed;, /it heccMm imr^> i^Mifffi^^'f'Qf fiag 




^ As q^^pproaches the value of the Euclideati distance d shown in- " / 
Figi^e 5,3 decreases and thus the angle 6 also decreasee, Thia f 
.v;/j2auses the value of j^^pj^g^ .to increase, while, pg^g and p^qg remain the " ■ 
same.'. A;s" a result Increases vMlatlve;;:fc6 and otg. Thus k^ . ^ -3^^^ ■ 
^an4 fc^ tbgether are no\r. more indicativa of claii- GT , Ho¥ conelder^'a ^ " ' 
^ case where the angle 6 is. equal to zero ^ that is 5 vectors k^,/and 

are "qjc^incident * * In such a case the associated vectoi* for caii rr^ 
he represented is a Scal^ multiple of Thus Qr'^ TPt t Qo YPo ^ 

1 1 : 1 1' . & ^ ' . 

- and 0^2^ the^ a posteriori probabilities oftthie classes are ■ 
■ 'ealcula|t^d mow using then we^have V ' . ^ 

will now '^e ^greater than and a- because p.' Is greater than \ " 

and p * That means -that , if a. document. were to be claasifted "based 
on Just these two keyvor& ihen class C shouli^-'b^ ^^osen, ■ The . : . C ^ 

; magnitude of the Bay asian distance should. register an Incraaae'' and - 
the direction should point to claa^ ^G^^, The new magnitude .is giveii 

^'rw- - / ..... ■\ ' ^ . ^ . ' 'f^ - ^: 

- ^"'^^-^ ^ 2 i*' ^ ■' ■ ^ , 

We note that in both equations (5.6) and (^.7), Y can be cancelled^ 

Mag(2) in equk^lm <5 should^ he greater than Mag(l) giv / - ^ " ' 
'by aquation (5iy)*"^^The direction Is^bhviously equuSk^ clu^sa 0= / 
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e fact that Wag (2) is - 
follomng theorem proves 



greater than JMagtl) is not' so obvious* The 
v^thii fac^t rigorously;- ^ 
^^heor^ 5.1 : - Gi van ' two keywords /of the form; - . . » 

^ • . ^ ■ ^ ^ ' ■ . ' ■ i ' ' ^ ■ \ ' • 

' ! V ■ ^ [YP^^ TPgf IfPj] 5 " " . - ■ 

let and denbte the magnitudes; of the Bayesian distance 
.tfrt'bulated considering keyword k. only, /and: k. and kn 't^g^'th^^ 
rtspeGtivaly^ t'feen > M^5\i.e, ^ . ..." ^ ' ' ' 



Y 2 p. 

lei 



lei 
lijel ^ 



(5. 



Proe^; After canoellatioft pf ,Yi 'the inequality can be' reduced to. 



.-E Pi 
i'lel - • 

jel ^ 



J lel.. 



[ S p J' 



Expanding , the denominators and omis^ting the limits of tHe summatiofts 
^tor/ simplicltyi ■we-vhave . ... ■ ' '. ■ -,' ■ ■ 



9^. 



i>j , 



Bcpanding both aidei we have ' ^ ■ . ; v>^' >^ ; i ^ 

(p^ + Pg + pf) (pgPi ^ p|p2^ ^ ^^^^ ^^^1,: ^ ^S^y"^ P3P2) : 



Reducing and factorizing 



we have 



AeSiming without of generality that p^^ ^ Bg'^ ^3* sea that; 

the first three; t^^s ^pf the ahove inequality are. all leas than zero* 
Ther^ore j^we have to show tiip.t,^ = p| + p|^' 4^^p|,^^^^ 

If P^f/Pg P3 aly equals then Q equals zerp*^ If 'P^ to 
nd^ Ind relied Md p^ the- same 5 th^n the value of" Q - 

hecomes less than ^ero,^ if we 'canv pr©,ye that ^ has. no extremal 

points/as p^ is -.increased ^ the inequality will Jiave been established.* 

^ ;.. ^ ^ ' ■ / ' ^ ■ ' ■ 

Differentiating/and setting to.ze^ro \re/ have ^ ^ , * 



/ 



3Q , a % _ . 



(5.11) 



Now from/equation (5. 9) we find that 



7 



7 . 

V / .■: 

\ / 



This is mpoisible^^xif p > .and p_ > p^. Therefore Q Eoee not have 
any. extremaX^^points, and the theorem is proven- / 

■ ' / . ' '\ . ' ^ - ■ . ■ ■ ; _ 

K stronger yersion of Thtorem^ 5*1 will now he considered* It 

iras noted in^ section 5*2 tha^^ ¥hen a succesiion of= good keyi^ords . 

occur in a dQciunenti the^Bayesian distfiwce magnitude increasei 

mQnotonically and the direction r^alns constant. That iSp if a 

/keyword ocoiirs vhich has the highest probability component for a 

class * followed by anqther keyword^- which has the highest: component 

for the same class G, ,' then our confidence in G-- as the correct 

class to which the document belongs increases. This experimental 

phenomejipri can he- explained by the theorem vhich is stated below , 

. yhaorem 5..2 ;- Givei^ two keywords = , ^ ' ^ 

/ \ Jp^* Pgi Pj] where p^ > Pg > 

\ . ^3^ where-g;^ >. £ ^3 - 

Let be the magnitude of the Bayesian distance calculated usiri^ , 
k^5.and Mg be the magnitude calculated using k^ and k^ respectively* 



J' 



' ^ The proof of Theorem^ ,2 vill. be prese^ C* The 

appeal of tKie reeult lies in the fibt that it is bounter^lntuitive. . 
Let us ^denote each tem of the suimiation on tKe rigftt hiand side'^^of 
the iriequality above by a^ ^ i,e , g ' i . \^ /. V. I ^ ^ ' 



(1^.12) 



^3 



Bimil^ly let b^ denote jeach term on the left hand side, i ^ 



We. note' that 



2 2 



(5.13) 



' ■£ e.',.'- 1' and ' 
iel ^ 



E b. B 1 



Mj^, and , therefore 5 : can nov be written as 



lif 2 2 2 

o , p ^ o 



(5.14) 



(5.15) 



If the cQriflltippf 



> J and 



are satlef led then the follo-wing^ineq^ualities *are truii 



Since ^^^1' ^'^2 ^ ^3^ must : be. less than^(a^ + a;^) . The; value 

(ag +^a|) may be increased by making a^ large and small. Similarly 

the Slim (b^ + bp*may.be decreaBed by .making^ bg-- equal to^vb^. It 

'V ' 2 * i 2 2 - S 

appears therefore that even though b^ Is 'greater th^ j (bg + b^) ; 

' _ i 2 2 X ^' 

may be made . sufficiently smaller than (a^^ + a^) such that' may be 

actually* smaller than M^, The theorem states, that sueh a situation*' 

cannot occur . In s^ort this theorem states that . fcfr. the purposes 

of ^classificatioi^s If two key&^ords are more, indicative of > ^ne class 

than other classes , then they are good keywords in relation to each. 

other. .The inequality above assies us-that in such a ease the 

maghitude ^.the Baye'sian distance will, increase and the direction 

.will remain constant, . ' -^ ' v : / ^ 

■ -^'^ - ■ ■ ■-„■ - : . ' ^ . . ( ■: 

However, it should be noted here '■that* the constraint iBposed on" 
keyword- kg fpr a given ke^ord k^^ ,^uch that Mg may be greater than 

* . ■ . . . i !» \ ' ' " ■ ■ 

represents only, sufficient conditions Let k^, aa before , be . 
given by * _ ^ 

' k^, / [P^, Pg. Pg}!]-. > Pg £ Pj/ 

Let k^, now^ be given by ./ , . j. ' ' ^ j\ 1^ 

. ■ ^ • • / ' : ■ . . 



We note tftat, kg nOT ii not a good kejT^ord'in relation to because 
the indtokf/Of its highest probability domponent Is different from that 
for k-j . UMer such circumstances can be greater than ? 

If the values of .q^, and mi^e such, that Pj^tj^ ^ ^2^2 ^ %^3^ 
then the quantities ^ ^ a , ^ , b^ b^ and b defined earlier will 

. f ' . . i - ' ■■ 

tftill satisfy the ineq^ualltles 

A- > a > a . , 
- .1 2— ^3 ' ■ 

■■- / b^ > b. > b^ , and ' 

: 1 d J . . - • 

^Hen^e depending on the values of a^^ .b| ^nd b^ it is possible 

• . ' ■ ' . _ ' _ »■ 

2 2 2 " 2" 2 2 

that (b^ + ^2 ^ greater than (a^ + a^ + a^) > Thus is 

^greater than I^, This ean/z^be Illustrated better by an exainple. 

Let k^ and be giv^h by ^ 

kj_^ [0.5, 0,U, 0.1] : ^ :^ 

: ' ^ kg, [0,1+1, -0.5, o.boooiL \ 

Here we see that^ the index of the=hi-ghest probability component for 
is class and for kg- it is Cg. Thus the cdnstraints stated in 
Theorem. 5, 2 are no longer satisfied. However if the BayesiAn 
distances are caleulatieds we see that Dlr(l) ^ Dir(2) ^1 anil= - 



Therefore 



. It tes pointed ou't earlier that a jnonQtonically increasing 
pattern in the magnitude of the Bayesian distance and a constant 
direction coiELd be utilised to recognise doci^ents which are^^i^^^S'l'i^ 
ih^;..natiire , The -di'scuaa.lpn aiDpve illustrates that in additiojn to = 
these conditions 5 the index of the highest proha'bility component 
for each kejnford should be checked to see whether they are IlII 
sajne.. If they ar,e, then the keyw"G.rds can be coneidered to be good 
keywords in relation to each 'other and the document can be considere 
to be ideal in naturei ' ^ / ^ 

She two theprems stated aboye* explain some of the experimental 
phenomena ; ©b served in the earlier eections of this cha^pter. 'Ihe 
next ohaptpi*- develops a classification algorithm based on these 
obBerved phenomena , Keywords are extracted sequentially from a 
docment and at gach stage the^mag^ and direction of the - / 

Bayesian distance are calculated. Changes in these quantities aru 
observed in order to Isolate noisy keywords and identify clusters 
of similar .keywords* If after removal of the noisy keywords 5 a 
^monotonically increasing pattern is observed in the Bayesian 
distance magnitudes 5 as in the case of an> ideal document , classifl^ 
cation is att^pted, beacriptlon of how such a method may be 
implOTented is given in the next chapter. =■ 
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■. . CHAPTER Vr '■"['''. 

THE REVISED iSEQUMTIAL !ffiTHODr : PRIMARY CLASSIFICATION ■ 

The previous chapter showed how the Bayesian distanGe measi^e ^ 
GOuid be used to detect the presence of noisy keywords and to 
identlify groups of good keywords in a doeiment* The hasic sequential, 
algorithm diseussed in Chapter III can now be modified to achieve 
classification by taking into account only. the good keywords* Thli v 
chapter is devoted to the design of such an algorithm. The algoritto 
works in two phases. In' the first phase keywords are extracted 
sequentially from a docment andj based on= the Bayesian distance 
analysis 3 the total number of keywords "read are divided into two 
groups— the good keywrords and the noisy keywords* The first section 
of this chapter" discusses a method which achieves such a separation 
of the keywords. Mhen the good keywords are such as to 3Beet the Q 
and S thresholds discusie|in the previous chapter, the algorithm . 
analyzes these words to obtain a primary class f or the doaument , 
In the second phase the noisy keywords are analysed to see whether 
these are, indicative of anpthir class , If so then the document 
is classified into a secondary class . -This second phase of the . 
jjajgoritto will be discussed In Chapter VII, 

■ ■■■ ■ V . ' : ' 97 ■ \ ... ... : ■ ■ 
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6,1 SeparaMon of Good and Noisy Keywords , ' , . 

Each keyword extracted from a ddcment, has the form ; 
[Pt 5 Pqs Pi] where denotes the probahllity p(k /C. ) • 

^ ThesX probability values are obtained^ from the conditional proba- 
bility matrix OP diicussed in Chapter III,, Besides a probability ^ 
vector of the form shown abova,. each keyw^ord has also associated 
With it an index I5 representing its location in, the docimient , • The 
first keyword^ read from a document has an index 1, the second 
keyword has an index 2, and so on* Each stage of the sequent^ial 
process now consists of. the following, - = ' 

(i) Q keyi/ords are read from, a document whure the . . 
" ' " ^ parameter Q represents a threshold* The 

' magnitude has to increase mo not on ic ally over a > . 

seq^uence Of Q keywords in order for a document 

^ . ^ ■= ■ . -^ , ^ , ' ' ^ ■ ■ -.. 

to. be classified in a primary class , '*This parajneter 

= \ ' . • - ^ ' i/ ' ■ . . 

was introduced and discuseed in Chapter V. 

■ ■ .r . . y^r ' ' \ ' ■ ■ t ■ ; ■ , ■ 

"' \ ' - ' _ (ii) the indices and the aseociated probability vectors /- 
for each of these Q ^'keywords are stored ' in an ^ 
" ii^ut buffer, ; . ^ 
In order to ^separate the noisy ke3rTOrds from the good orney, two ^ 
auxiliary bmffer^ are set upv Each of these; buffers are 'capable of 
storing the index, of a keyword where the index serv^si as^'^^o^Ln 
tq an entrj in the input buffer* . The first auxiliary buffer will 
be used to store the indices' of the good keywords , and will be 
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' called the good keyword buffer (GBUF) . The second auxiliary buffer 
will be used to contain the indices of - the . noisy kesrwords'i and' will ' 
be: referred to as the noisy keywordybuf far (NBUF). , . \ 

Initially JLet us assume^ that all the Q keywords read at any 
stage of the steauential process are good^eyrords in that they are 
. all indicatiye of one class * Therefore the indices of 'these Q ke5rwords are 
tentatively loaded in the GBUF* Suppose ^ W^, , , . ^ W. are the 

keywords whose indices are in the GBUP. Eaoh W. is ^of course a 

■ " ■ • • ' ' ^ . . \ - ' . ~ \ ' - ■ ^ ■ 

member of the keyword set K ^ {k. ^ k^? > * * ^ k' }\ In order to check 

■ ■ I d ni " " " 

whether a monotonically increasing pattern in the magnitude of 

Is obtained over these Q keywords, the Bayesiari distances are 

' . ' ^ • ' *. ^ ' ^ ' • ' 

ealoulated using keyword 5 then keywords bo on. The 

■- ■- ' " ' ^ ' ^ ■ . '' ' -. ■ ' ' ■ ' ' 

fora of such. a monotonlcally Increasing pattern was shown ih v 

Figure 5*1 of the previous chapter. If at any stase a partieiilar 

^ • ■ \ . ' • ^ ■ \ ^' ^ \ ■ - = ; ■ / ■ . / 

keyword causes the magnitucie tQ decrease or the direction of D^. to 
■ ■ , ' B 

change^ xt is tagged as a n^sy word. Its index is then removed 
from the GBUF and put in the NBUP. For exampie, suppose four 

keywords W^^ W^, and have been readp The indices 1, 2| 3 and, 

■'k\ ■ ^ ' ■' _ .. ^ * . ^ ' \ : . -^^ 

. A are tentatively loaded into the . QBLJP* Further suppose that 

]\ . ' ' ^ ' ' 

keyw^ords W ajid W yield a monotpnically increasing pa in th% 

inagnitude of the B^eslan distance ^and a conitant direction. It 
is then assiimed that .they are good keywords in . Velation to .e'ach 
other. Now let kfejrword be such, that it causes a deprease in^ 
the magnitude* Then it is tentatively tagged j a noisy key%rard and 



101 



ERIC 



the index 3. is removed from the OBUF find put^ in the NBUF. The 
situation at this stage is depicted by Figure S.l* / The process of ■ 
Galcuiating the Bayesian distance is repeat^'d asirip keywords Wj^, 
and to check whether Wj^ Is a. goed or a mqiajr keyi^ord. The ' ' 
jTollowing section describes how the Bayegtwa cjiitances are calculated 
and how the magnitudes and direetiom. are tested at each stage to 
identify the noisy keywords* ' • ; 

6,g. The Bayesian Distajice Calciilator and Ifglee Jstector ■ ■ % 

• ■ ' - ^ ~ ■ • ^ ' . " * . : ■ ' • . 

, Let Mag(i) and Dlr(i) denote the magfittudfiB and directions 

■ ■ - ' - ■ ■ + v» . ■ ■ ■ • 

vrespectively, oalculated at. the end of tte t jsjrword in the GBUF. 

This is done by first calcijlating the atj vaLues using the . first 1. 

keywords in tbe GBUF. ' . ' < ; 



V. ' „ _ , "j ° t^^i ■ ' § a-a to.t. (6.1) 

Magi i) and Dir(i) are then calculprted^ using the^e values as follows: 

. MagCiJ - E (aj : (6.2) 

= ; • ^ i-r;^ ^ . ^ ^ ^ ■ . : ■ : ■.■ 

' ^ ^ Dir(i-) " index of the highfei<aj value^, ^ 
For each value of i greater than tw^ a noise de-tectlon procedure Is 
mplemented, / If either ^of the^ follovrlng ti^& Gpriiitiona 
' . ' t (ij Mag(i) < MagCl < 1)\ or , ' . * 
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(ii) DlrCl), ^ Ptrii - l>,i . 



' Is detected th©rt: the i ke^rord in the GBUP^is assmed to te noisy. 
^ Its. index is put in the NBUP and removed from the GBUF. " The index ^ 
of ^the next keyi^ord in th^ input buffer is jtHen loaded irrW 
tlie GIUF arid'^h^.- process xs/repeatett^' WheriFthe following conditions 
are foundi - . . " \ ; " ■ / / ■ 

(i) Magdl < Mag(2) < ... < Mag(ft), ■ 
: ;;;; . , ^ (ii) Mag(^) >:-S vhere S is the^ pres saturation 

.. ^j^r'TOlue dlspusSed in^aebtion ^.S, and '^^^^ 

(iii).VDlr(l)^ Mr(2)^^;^^^ ' 
then a inonptonically increasing pattern in the jnagnitddes^lof .the 
Bayeslan distances 5 mth the dl^fection remaining constant, ls\., -'^ " 
obtained, kt this point ther#"ai*e^ Q^^^^^ in the OBUF which' 

are good -keywords in relatipn. to each other and whioh between them 
indioat^t;^*.?. 'aniq^ue clasei This Glasg s,f whose 1^^ by 
DirCQ);^iB identified as the primary ;eia$i'^'^ W docwnent'.'^ In 

ceysfe'^oni' of. the three cqndttidha lie abiove Is not satisfied and 
the :doit)merit contains no more keywords to be readj then it is ' 
tprmed melassif table. / ■ ' » ^ , ■ - 

It has been shown that after reading each fceyvord a decision 
' is made as. to whether it' should be dbneidered good or nots^ dependirii 
on the dlreetion and magnitude of the Bayeaian distance. Based on ' 
this decislop; the; index of the wor^l is .eithei*":; k^^^ the GBUF" or - 
is put in ihe 'l^BUF, Since a keinford is gpod qr' h^la iii;.- relation 
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I 

to the-:^xtlier; kei^ards tftat; h^^^e^K'^ix px the iMu|em , it is 
quite possible, that , at 'stagr of 'the se^entia /^^^^ the UBUp' 

.jpstfnrtalns this good kejmdras .which ioirit tp- the- cor^Gt primafy oiaas 
' Klis can be let^ar illustrated by an ajcajnple, 

^ ; SupposeATO havf read aperies of ke^nmrds; W , W^, w:,_W, and 

Wj,^ Then let us ass^e that ttie following pe^ence>f actions hare 
been .taken: ' . ^ ^^-■""'^ ri^ ' ' ' ■ / . ^' = ■' - 

• •; (i) :the iriaex of. Wwi \ ' 

. ■ (ii) is good in Mlatlon to^W^ and so itp Index ie- 

put in the QEJ m!nd ' V^^ -'i^:"^. ^ 

.. - (til) W* and W= are noisy ^n relation to W- » and W V 

- ^ ; ^ >nd so thei^, indiGeg are^^ P^ 

At-, t^ contains^ -thf indices df " ""and. W;^, ^ile tiie 

NBUP contains the indices Sf^ Wg, Wj. and . Suppose ana v^ ai?e 
Andlcative of class -0^4114 ai^ Wj^ffi- all "inaicative of class 

Cj." If now the magnituda of ' caaculated by using W,, Wj^ and W 
^exceeds the: jalue^f the magnitude calculated, by Tising and ^ in ^ 
the OBL^, it ft highly; iikely ' that are tHe^^set of ^ 

good,keyworae^whicli^oin€ to thi eorreat primary class, it this la 
the' case 5 then thi^indicea of ihe words ii tlie GBUF should be . - * ^ 
Jnterchanged with those in the. NBUFv^ "To achieve thlsV 'it^ every 
etage when the nmber qf word's in .the WBUF equala that in the OTUFi 

the Baye^ian dlatanoes of, thfe words dn it are. Wlcuja^^ If a V 

' ■ / _ ■ , /'j'Vv • ■■/ . V . : .. ^- . . ■ . 

imique directton is obtained and if the magnitiide exceeds the / * 



ERIC 



magnitude of - tlie BayesjTO distance calculated t^jr usln^ ±h_ 
the GBOT then the two buffer^ are In'^rGhanged. 
' ^ The concepts discussed in thlg .BeGtion have^ been^ ijnpleniented 

j as a>: primary claBsification algorithm* .The ^next section briefly 
iiadusges thje algoritta. ' . . -- ^ . 

\ . - ;\' •.■ ; . . 

6)3 Dascription of the PrimarV ClasslficatlQn AlTOritto 

* The first phase of the □lassifieation algorithm^whlch obtains 
a primary clase for ;a test-doci^ent contains. ^eeveral, pafctsj .the 
most' important of' which are the prtoary classifiarj ithe Bayesian. 1 
distance calculaJtiSK, and the ^noise detector. ' In^ this section we 
will briefly discuss these componeiits'of the algorithm portion 
of the algorithm which extracts ' ke5™rds ^rom a documant; will' be. ■ 
described for^the sake of -Sompleteneks / A more .comprehensive/ " . . 
description will be^'given i Appendix D.'^^A flowchart of the ^ ^ ; : 
algorithm is given in Figure' 6.2. ' / ; ; : . 



,6*3*1 ; The Keyword Extractor / ^ ' - / ; f^-^" 

This pdition of the algorithm b reads one kayrord at % -bljn^:; V\ 
from a g'iv'en \test doci^ent and' ^toi'es the following in^ an iripiit- 
""buffer: ' • -"^ , ' ■ ^'^ ^ 

^^i; (i) a yalue i jcoj^responding to the index of the ^ . 

keyword read* ^. : - * * ^ * 1 \ 

(il ) the* kejnford^ md ; ; * " ■ • " 

(iii) the pMbabliity values asaoclated/with the keyworda. 
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■ . Thise ajji^' stored in a matrix :F^^,j)|s .where i 

v^-V nepresents. th^ index of trit kejnfdrft and J "represents 
' the ■ index of a category . / . * ^ ' 

^ 6> 3/Sv The Bayesian Distance Calculator J: >;'v''V ^ ^ . . 

' ^ • - ^ '. -■ ■ ^^"^"V " /^-^''-i >■ ' -^-^ . . ' " 

The inpirt^ to this portion of the ,/*ilgO]^i1|j^ra consists of a set 
of indices S^,^ which comprise either/the get of thdiees of the . 
ke^^yds in the^.GBUF or^thd^ set pf.ifidlces of^the keywords, in the .. 
NBUP,;.. Let this set 3=, aK-zany given- stage he^S,^,^ ^ fi i , \ i }. 
Then the Bayesian distance calciJLlator J com the magnitudes and 



directions-of D= considfe^lng the keywords Wi 1 then W, and 5 
then W, J W," and, W, , and so on. The jkagnitudes and directions 



Mas(i^ ^/ MfeeCiCjj^; .v'^^^ and - 'Blr(i J , mr(i« ) , . , DJ;r(i ) 

are stored dn ^an aj:ray for analysis iby the^ nois^e detector w ^ .. ■.. ^ 



5,3*3 The Noise. Detector - ' . ' / ^ ^ 

Using ^^e array of magnitudes arid directions calculated'by 

..ff ■ ' ■ : ■ ^ -^^^ . . . , . 

the BayesiM distance dalc^ the noise detector Q^eakm for a 

.monotoniqally inQr^sing pattern ,in the magnitudes * If. this^ is 



satisfied it checks'-*to ^ee wtother the directions are ^1 the p sine, 

i • * ^ ^'.= ' ■. ■ ■ , . " ^ 

Suppose in the array of magnitudes directions 6n&-or both of ^ 

^ ^ ^ ' \ - - ... ' ... \ -. ■ ' ' -g 

the following two caiditions ■ = . " 

■ . ■ ^ (i) Mi^(l^ < Magd:- 1) 

•= = , ■ ' " * ' ■ ^ ■ • ■ ■ . 

^ (il) Dlr(i) .0 Dir(i -1) . . v ' 

• _ . \. • . . " - »■ 

/ is detected, then the i ke^rtford is iftentifi6d as a noisy keytrord. 
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6.3. The grlma^y Classifier 



The/ input to this secti&n of the a'lgorlthm consists of the ' 



array of magnitudes an^ directions oalciilated by the. B^yeslari 
distajQce paldulator ahi^ the . indiceis of the^noisy keywords computed ^ 
Jhy the noise detector* The par&ietera Q^nd discussed^ e 
gpl^ern the operation of ^ this section of the algorithm* ■ TOie primary 
classifier performs the following functions* " 

(i) It loads the set with" a set of Q IndiGes from the 
• ' OBUF and uses the B^yesian distance :calcuJfiito3^, . ' . 

;-' — * to compute an array of magrfltu^es and direq^ions* 

(il) It uses the noise detector to identify a noisy keyword 
' - In the set of keywords obtained from'^=the\GBUF, ' , 

■ ^ (ill) . If -tlfe noise detector identifies a^nolsy keyifordf It^ 
removes the index of this keyword from the GBUF and ^ 
places it in the NIUF^. It then uses the k^.yword ^ 
extractor algorithm to^^obtain aji additional 'keytSrord. ^ 
. If there Is such a Keyword iffls index is loaded into the 
GBUF* If there are no mofe keywords in the docment ; 
thmn it identifies the docunent as being unalassif iable , 
: Xiv) Each time the Index of a &rd- is loaded into the NBUF' 
it checks;; to see whet her = the size of the NBUF eq\ials 
' .fehftt of the GBTOi If so, then it cheeks to see 

■■^r whether, based on the BayesliJi distanc^ valu^ the 
• contents of the NBUF shotild be interchanged with those 
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' ■ f ■ -■ • ■ * 

. . ■■ ' .■ '■ ■ ■■ ■' u ... ■ ■ ' ■ .. ■ 

of . the GBUF* /The criterion which, governs sueh an 

■ ' • • . ' ' • ■ • ^r-'^ ^ ' ^ • ■ ^ : . ■ ' 

* ^Gtlon ■has'v'besen discussed in ^secti / 
(v) If the^no&e deteGtor doee not identify; a noisy keyword 
, . the^Q^^eywords presint in thi GBOT^ the primary 
■ ■ Qlassiflfii^ checki to see whether #Mag(Q) > S where B 

is the sEttiyatlon value. If jo^vthen the slaei 
. given Dir(Q) is^entlfied the primary 
' ^ ^ class for^he doomefrit, lf\Mag(Q^) < then an - . 

additlonai 'keywrd is read. If the docment does not 

' • ■ . . • . ' ' ^ • -r,- ^ - ■- 'r/- 

contain any Aorenceywords 5 ^ tM priinary classifier still 

^ clasBlfieB It into the cHass given by Dir(Q) hut rec or 4s the 

- ■ fact that the S mlue was not. Bmtisf ied. This is to signify 

that the confidence in the priMary class obtained for 

this document is less than those which have satis f ltd ^ 

V- ^ ^ ^ the.S threghbld. ^ ' ' -'h'- - ' 

/This algorithm has been implemented. to classify approximately 

5'00 docments contained dn. one of. the releases of tt© SPIN data, 

^base. - A' brief description o5 this data was given ih-Chapter III • 

The following section prfsents the results that have tieen\ obtained* 



6.> Mpfementatlon of the Algorithm and Result s of acperimerits 

Since each of the SEIH docments has been preclassif led by 
the toerican Institute of Physica. (AlP)^ an estijnate of how well 
. the algorithiri-^'has perfomed cdiJ-d be easily obtained,. This waB 
done by comparing th# 'clasiif icatioh obtained tiy^the^ algo 
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that given by the AIP,^ • : - ■ ' , / 

As pointed out In section 6,3/ the algoHtte has tw pareineters 
Q and S> The parmeter: S denotes the final value of the Bayesian 
distance at the point hen a doci^ent is classified, . It repreeenti 
a measure of confidence in the classification that is obtained/ ' 
The parameter Q repres|nts the number bf keywords over which the 
Bayesian distance has to satieftr a raonotonlcally' increasing pattern 
in the magnitude, while maintaining a constant direction* initially 
S was varied from 0.8 to 1.0'for different values of vi^;, ^^ ^ 

^ and \ln air these eases i{t was found that ifi^a dooumeht 

satisfied the Q threshold then the final value of the Bayeslari 
distance, i*©*, MagCQ), always resided in the interval 6,9 to. 1,0, ^ 
Variation of S /there fore » did not produce any significant change 
in the resijits md herice a fixed * 
value of S - 0.9 "tos used, Before discussing the^results of " 
these experiments" further^ several terns heed to be defined. 
Let . ■ .■ ■ ■ ■ 

D - number, of documents .correctly classified 
T - total number of documents in the data base ^ 
. V ^ number of docwnents found unclaisiflable. 



mien 



. .A^ ^ accuracy over those claaslfied ^ ^ ^ x 100. 
Ag -^overall accuracy. - 2. 3^. 3^00^ ; - ^ 

The first set of experiments conducted uspd a fixed value of 



= . ^Q. For^yalues of Q - 2 , :3:^d U the^^^^^ presented fii -J 

Table 6,1. ^s^ can be OTpected^ 'M Increase in Q results in higher 

. yaluei for^ A - - This is because as InQTBrnBes mote .kiyvordB 
exainined before k aocmto Thus f ^ 2 only two 

good keywrds are needed before a prlmaryMelasi is Ideiltified,- Jin 

inany cases this might l^ad to a preclpltous/.a*ciiiow. i&y^or^s ^ - 
■ odciirring in the rest of the document may 'poini to an entirely 
different Glass whic:h;^in many cases: may be tbe .corijeQt elass*;' Ari^ 

-increase in 4 avoids such precipitous decisions and hence' increaies 

the Value of ^A^. ' ' 

r. However 5 an incrf ase "Jn Q means a docment has to have 

greater nimber of good keywoi^ds in order to be classified. Fewer 

ddcmients are able ^t^ satisfy this more stringent erlterion, " Hence 

the number of unclaasif led documents Increases and the value of 

pverali accuracy' (Ag) decreases, " 

• Some of the unclassified docimients ar^ such that they contain 
a .set of good keywords indicative of a unique ciais. .They are 
identlfled^^-as uncla^slflable by the algorltM because the nianber / 
of good keywords contained in them is not high enough to satiif^ a 
fliced Q threshold-. Kierefore in th*e next set of experiments, the 
value of Q was d5rnralcally 'varied during classification* Table 6,2 
presents the results of these experimental If a particular • 
doeiunent did not satisfy a given threshold then the value of Q 
way- reduced by one and the doement was reconsidered for 
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Table 6.1 Primai*y C^SBiflcatiph w Fixed Q 
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Table. 6.2 Priinarjr Classification with ^Ramic Q, 
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^ classifj-catlon under the foirOTing conditions*: * . . = 

^ = (i) the dOGTOent has bfeen r^ad entirely^ and 

(ii) no noisy keyword has been id entire 
'■' _ ■ • ... document * ' ' . .v , , ' / / ■ ^ ^ ^ • ; . 

For exampiej suppose the starting of Q is three g a test 

docT^ent. contains only tTO keyTOrdSg both of vhieh are good* keyw^brds 
in relation to each other, l*e, p they are both indicative of one 
class . Then the value of Q /wo uld .be. reduced to tw. The primary 
classifier. TOuld then attempt to. identify a primary elaas f 
do.c\pent based on these two kej^ords, ' \ ' 

.^i^^e suits 'presented in Table 6 that some of- the 

previously unclassifled/doQM are now correctly classified* : 
Therefore the value- of averall aGcuracy increases* . However, because 
of the reduction iri ^^^sdme of the. previously unclassified doQumerits 
are now Incorrectly classified. Therefore ;fche value of tends 
to decrease, , . . - ^ 

6*5 Sebo.nda r y. Classification ' ' ' ^ ^ ■ \ 

■ The discussion in this chapter has de^t with the design and 
implementation of an algorithii which obtains' primary classes for a. 
set of documente* A sectiorf of this algorithm separates' the noisy, 
keywords from the good keywords in a given document The Jnoisy 
keyirords %re put in a separate buffer callpd the NBUPv , In effect 
this means that at the end of .prmary classification two different 
fjroupa of keyycirdn are obtained. fThe' good ReyWord group Is , ' 



^.anaiyzed to obtain a prtoarjr clasg. But aiout. the group of 
keyroras • that ha;^e Tbeen 't^gea: as noisyt If tJie .conten;ts. of a 

dopiment are siich that a seson4ary class coulti be:! dent if iedV then 

^ ^ . \ _ ' , . < . "■ 

it .Is v$Ty^ liHely that the^^eywords^ ithat ha^^ leen ^i^solat^d as " ;. 

ndisjr may actually he to If . this is the 

cassj then the wrds in the^JIHJP could he'anaiyg.ed to obtain a : 

poseible 'Mcondary class, ^ The next chapter s-tereeseis. itself 

this problem and, outlines an algorithm for QbtQlmrig seaondary . - 

classtflcatton, ^ ■ ' . . '\.^■^\. . '. 
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■ . ' A^CHNIQUl FOl/SECOmARY CLASSiPICATION; 

/ The previous chapter has dettt with the design and implementation 
of a claBeification algorithm ^aBied on,the EayeSlan dlstaifGe measure*... 
This algorithm classifies a, document into class -Which is denoted ' 
as the^ primary class, Beside^©btaining "a primary class , an added 
feature* of this technique is that it ■effects a separEtibn of the good ■ 
keywoWs and* noisy keywords into two different groups. At the 
end of the primary classification the noisy keywords are contained in 
the noisy keyword buffer (NBUF)*? These words may be/ unrelated to 
each other In -theft they might not point to, any= one ^c^aep;,:^'©!^ "they, may 
fom a cohfrenlb cluster in such a way tfiat between them' they are 
indicative of a category vhijch may be identified as a secondajry class. 
This chaptel' addresses Itself to the design of a method that wlir 
■analyze the words in the NBUF to e^^lore the possibility of classifying' 
a document' Into a secondadry cla^s,. It ^will ^e pointed out that in 
many .cases the keywords In the' NBUF alone are not adequate to obtain 
sucti a aiassificaM ^In th^seibaBes keji^ord 

extracted from the good, ke5rword buffer tq corroborate the Information 
obtained from the words in the noisy buffer, ' J 
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7*1 Ahalyeis of Keyworde in the Noily Bufifer , , / 

A#' "the end of prima^ elasiif ioation ^^he.TO 
indices -Of the keyw'ords which have tieen identifie^^^^^ aoisy key^^ords 
by primary classification i Let this eet of noisy kejn^orde bt denoted] 

:"'"f^ N^'Cw^p Wg, r.V, 1^}, ' - - ; ;v - / ■ 

The philosophy underlyi^ fthe analysis of the words in the; NIUP 
is the same as that used for the analysis of the words .dnthfi'GBUF. 
The values of the magnitudes and directions of the B^esian . 
diB^tarida are successively calaulated using words W^, then and 

anis'o on. Let Mag(l) aji4Dir(l ) represent the magnitude and direQtlon 
rptp^tivaly, caieulated'% using the first i keywords in the NBW . For ^ 
each value of i the noise detector di-seussed in section 6i3 of the 
previous chapter^iis invoked to ftheck whether a constant direction^ ^ 
and a monotonically increasing^ patt^ern In the magnitudes is obtainedt 
As in the primary classification 'e^lgorittosVthe nm of keywords 
over which a monotonically increasing pattern has to ,be satiofled is 
denpted by Q, If the noise detector detects one of the following 
situations \ . f * 

/ ■ \ ■ ' '(i) Mag(i 1). > Mag(i) 5 or ' 

(11) Dlr(i - 1) ^ Mr(i), 
then the i 'keyword in the NBIF is considered to be noisy with 
respect to ^he 1 1 keywords preceding it, Thfe index of this keyword 

r ■ 124,; 



is then removed from the NBUKfand' placed In a miicellaneous buffer . 
referred to as tfflUPi, The BayeiSlan distance is again computed, for 
the remainihg words in the NBUF to , check whether therr arg^ any, mort ; 
. noisy ^e5^q^di^,^.. At the end ojP^;-this process , ^hevNSlpf contain 
of kbyw-ords whibh 'are al good^with respect to each other in that 
they are all indicative of oni class. Let .these®^^ denoted hy 

..'i;,.-v^Jf\tliere aT^_Q such good keywords in the NBUF » i,©,^, ' ''' 

if r^> Q if Mag{Q) >/S, "then a secbndary alast^ is ■identified 
as iii;(Q)f '4 _ r ■ '■■'---'^ ~ ' % 

, One "problem that is .:^riGpuntered by this technique IS /that after 

i;;-; tlie noisy keywords . have been eliminated Yrom the NBUF the n^ber of . 
fekeyworde,; remaining in it may hot be Buffidient tp satisfy tl^^^'' --y 
threshold* Under such cJ^mnstances ^ this technique woiild not attempt 
to obtain a secondary class. Thls' might be very restrictive. One 
y ^ay tp clrcimivent this problera would- be to selectively choose ke^ords 
frdm the good keyword buffer to, corroborate the Information obtained 

. ■ . <■ ' ■ . . : . '-^ . . . . ' 

from analyzing the words In the NBU|, The next section outrines a 
method fqr ^doing this ;^^ 

ft 7.2 Analysis of the Words in the Good Buffer ; ^ 

\ / Let; us assume that the keywords in the NBUF do not satisfy/ the . 
Q thrfeshold b"ut yield a constant direction and a monotonlcally ^ - 
increasing pattern in the D magnitude * If thla fixed direction la 

" /■■:?^''-;^.:v' ' .. ■ '■"■'^-■ V ^ ' % ■ ■ - ; ■ : 

: ■ ' " ^ . ' 2. ■ ■ . ' ■ ' " 

■ ■■■■ . ■ . > ■ ■ . . ■ .. ■ .. ■ • • 



that^'pf^ G^ass C, , then ^C* is treated as a potential sMonda^ class 
Let. US. as iime that category-- has been chosen as the primary class 
The wo^ds in the GBUF are now extracted and, along with thelr-^ 
assOaiated.,,j^^ from ^he; input bu^ are 

stored i,n a matrix as shown in Figure 7 *1' Colimn J in this ^matrix 
contains the probability values eorresponding to: class . Any 

- keyword which- has a nori^^Jf®'^^ probability value in this column is 
keyword which might be used "to provide information about class C 

-Therefore, any ke5rword which" has a non-default probability value in 



column 4 is isolated from this matrix- Such a keyword will have 
the f orm r " 

J [pj^i . * * 5 p^5 ,,/^.^pj5 p^] ^ 

where Pj is 'great en^ than the default . value * Since this -ka^^ord is 

B good keywb^^ for the primaiy class ^ p^ is also greater than the - 

defatilt value,. Let -the set of these keywords be denoted' tiy ; - 

;G ^ {W 5 W^-/ Each of these keywords is indiM-tive of 

classes C and G , Since, however ^ each of these keywords is a 

good keyword for the primary class , in general the probability 

component p^. should be greater than the p.' component * That is, all 
..1 J 

these^ keywords ar^^probably#tf"ore indicative of class C^^ than of 
class Cj, A measure 'of how 'ftrongly indicative they a^e of class ^ 
^C^ can be obtained as follows* , . ■ 

. ^ Without loss of gMerallty ,/^^^ one :ke^3rd in , 

:fi./:#Sre used to clap s4fV the document,. The a postieriori probability 
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Figure 7.1 Keywordi in- GBUF and- their ^.Asaoeiated 
f ' , Probability VolSors ; 
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Of tfife^'ca^seee, tHe a, mlues^ om he co^ut^d by equation (7 -l/ ' 

■ ' ' ■ '. ■■ ■ ■. Pi ■ ■ -■ 

' tt, - r , j=ltot. / . (T.l) 

Since this is a good keyword for class C\ , a, is, a maxiinim/of the 
ttj values, A measure of how . strong]^. relates dobi^ent to ^lass 



3j can be obtained rby computing the ratio }^ where^ 



- - ^ ^ ^ / (7*2) 

v ^ ^ : ^ ^1 ^ Pi 

Thus whila choosing keywords ftom G to cori*6borftta the information 
given by the keywords in the^WBUFj this ratio mayyDe compared with a 
preset threshold S^. A keyword is chosen o.n^ i^ the corresponding 
ratio exceeds. The parameter fflfy be ymried to extract ^ 
^' keywords f rom vfiry seleatlvSly in that if S^: is increaeed tha ; 

■keywords that, are chosen wil^ be more highly .indicative of olasi ^ j 
' For a given classificatian experiment, is fixed at a^^cerj^!ja.ya^ 
Letv G' ^i^W s- W^V • * • 5 } be thfe /set^of keywords which satisfy 
^he threshold. G' is then merged wl/bh the set N* tsee eec^ion;^ 

7.1) of keywords present in the NBUF* rSince the primary class G^^ 

th ■ ^ -J ■"' 
h§a already been choseri, the i probaoility component of all these 

keywords is set to the default v&lue* If there are at least Q 

keywords in the mergea ::B<it G'UH^it then the indicell of ^hese keywords 

are loaded into the NBUF and the procedure outlined in section * 

is repeated to determine whether Cj can be chosen as a secondary 
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clasBp If "not, then it is assumed that a secondary class does not 
exist. w - . ■ ' ' ' ' 

7*3 Resets of Experimenta 

Hie first set of eKperiments conducted to obtain/ seGpndary 

' . .. - = = " .- > ' ' . . ■ 

classifidation eon^efttrated exclusivily on the words. In the noisy 
"biiffer after the primary class haa Hedii obtained. 'The method has 

"been outliaed'in sectipn 7*1* Initially the conditions imposed for • 

-.t 

secondary classification were as stringent as those for the primary 
"tase, i.e'*, a value of 0.9 and 3 was used. .Ihe-s^^Siults , 
shown in Table T»li indicate that ^yery f ew documents were classified 
in a secondary class* A reductiph of Q frbiiitliree to tFO- substan-p 
tially increased this number. Some of the dociunents thftt^'wera 
classified by the algorithm were also^ assigned a secondai^ class by 
the AIP.v tt*able 7 -l" indicates that all of these documents were 
assigned the correct secondary class by the algorithin. 

In order to increase the number qf documents which coul^ be 
classified in a secondary class by the algorithm^ a secondisset of 
experiments was conducted by cbmbining the words in the NBUF with 
selected words from the . A method for doing this was discussed 
in section 7*2* If the NBOT did not have enough keywords to satisfy 
a threshold of Q ^ 3| then Q was reduced t^ two by the method discussed 
in section Table 7*2 presents the resists obtained by. varying 

the parameter S^' from 0 to 0,8. As the value of inGreases, fewer ; 
documents are classified "into a secondary clafls by the algorithm. 
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Nianber of SPIN dociments clarffflfied in a secondarjr 
class by the 'MericAn^ Institute of PhyslGS (AIp) 101 
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NUIffiER 
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A, I.E. ' 


, NUmER 
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0 


^ 211 
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Thts is because ae is increased^ keywords that extractad from 
the GBUF have to be more iriaicative of only the pot eritial second 
class order to Vtoe aCGepted for consideration* Since fever / 
keywords can satisfy the threshold, there exist a smailer niMber^ = 

/ '.^ ' , r : . ■ : y ■ . •■ \ .... : '^r::rQ ■ 

of docvmients Having an adequate nimher of keywords . to. satisfy .thj>:Q< 
I threshold. We note that Tables 7*1 and J. 2 do not have /.entries for 

classification aecwacy as in the case of primary classification. 

This is because deteraiination of acc^acy for eecondary classification 
-is not as straightforwaiid ai it was In the case of primary clasiifi- • 

catioh. Thi^ prohlera is discussed fn the next spction.^^ 

7*^ Validation of Results Obtained by the Secondary Classification 
Mgorltfan 7 " - - — 

In order to evalmte the performance \gf the secondary classifi- 
cation algorithm, it Is necessary to compwe the resulta^ obtained v 
with those given by a standard elasBlficatlofi scheme, ; In thu uaae 

of primary classification this was done by comparing the results with 

* . ■ - ■- ■ -. 

the classes given by tht Amerlcaa Institute of Physics (AIP)* . In : 

the ease of secondary classification this- has not been possible 

because AXP has Glassified only 103 of the 500 SPIN documents into/' 

a secondary class. In many cases the Bayesian dietance algorittai 

assigns a secondary class toTnllrger number -of documents. The 

situation .can be best depicted by a Venn dlagrm shown In Figure 7»2| 

where D represents toe set of documents for which a secondary class ' 

,,a ^ - ^ ' ' ' 

has been assigned;by the Alp, 'and D. represents the set of documents 



. for vhlch "a eecond has -bfeen assigned by. the Bayesian 



diettface algorithm. For the set' of doGuments p . s i,©,, those that 
haya been'^sslgned. a yeondary claei- both sGhemes ^ GOmpariaon ^ 
is ;:etraight forward. Such a comparison can be found in the third 
; and fourth GOlumna of Tables 7.1 and 1^.2* Kie third; dol™n re^fsenta 
'the set D _ . and the fourth column represents the niaiber of doGiunents 

^ in P , that have been correctly -classified^ by. the algorithm, ■ 

ao . ' .1 

\ The problem is to obtain a method by which classifiqations 
-obtained, for document a l^ing in the- non-int§rsection regions of 
these two sets can be compared , For this purpose the results need 
tOfb'e examined from>a different viewpoint* The follpwing two 
%xp»iments have been conducted to compare the classifiGations. 



(1) T^e document setg D' and "are list^ along vith 

a D - ' 

their primary and secondary ciasses ag shown in 
lists and Lg below. 



D prnjnary secondary, ^b primfiry secondary 

^ class \ cl ass class class 

jlI il ■ Jl ^1 ^ Q2 



d C. d^ C C > 

an ■ m . . . Jn . ■ Dm pm > qm 



\(ii3^=For pair\,(S^ 5C ) 5 the set of do amen ts 

V that have been assigned to "both these clasiee ie", - 
ohtained from list L^, ^i .B^ keT^^ords^ 
obntaiaetein these doeimentSs' a standajrd class 

correlation matrix ^ S :\ is computed as follows. " 

' « - - • ^ a ■ ■ ' • • 

. ■ ..Let . . ^^'^ - - 

, ^ total number of diffBrent^ke5rwords > 

' 4 contairied in dociiments assigned to 

^ ^ . class ' 

/.^ ^- • ^ ■ ■ ■■■ ^ ; ■ ■ ^\ ■ ,. ■ ^ - u- \ ' ■ " 

N/^ total numter of different keywords 
= contained in dpeuments assigned to : . 
■ • ^ k'^-, . ,: class C. ^-^^^ 

^ ^ total number of different keywords 
contained in dociiments assigned to 
, ' "both classes C.^and C* 
Then^ ' ^ : ■ ■ . \ . - 

: " s^^i.j) . •■(7.3.) 

' . / .... . i ' J 13 . 

(iii) The jjrocedm'e outlined in (11) is repeated for 
; list from which a cprrelatipn matrix is 

obtained, :^ ' ■ = 



Since S is the class correlation matrix obtained from the AIP 
claisificatlon^ andg^ obtained from the results given' by the 
Bayeslan: distance algprlthmj a comparison pf the two matrices should 



an' indication of hbv^ well the Bayesian dirtance ^laBei£i€%^ ■ ' '■■ 
/ . has \perf ormed * ^ This comparison is done by computing the mean stuare = 
difference between the mitriceB S and using equation (7*^)- We 
note .that' and are syjnme'firio m^tric.ps, . ■ *■ 

■ \ ' ' ' ' ■ p ■ t-i . t ^ ,V- 2 '■ ■ ' 

•.. , . J4SQ- = - ..^ Vl . t [S"^(i,d) -^S. ] ] .,':, (7.10. 

. ; ,f : , ... tit - i; -jsi+i ^ ^ . ^ . ' 

The smaller the value of the mean s^are differeneej the more nearly 
r egual will he the matrices S and S^. Hence the two elaesifications 
will be more Blfiailar to- each other, ' 

^ = ■ V is to relate the', mean square difference between 

the correlation matriaes to classification accuracy. -Wfe ^ 
in the case of p^imai^ classification, ^apcjiracyj 
obtained by comparing the class indicated by^the Bayealan distance 



6ould be directly 



method^ with that assigned by thfe AIP /for eac^ document. Thlfr }■ 
infomation can be used to relate mean Bquwe 'diff era clasalftr 
cation acem'acy in the^ following way, ' ' ! 

' For each of the experlmente conducted for, primary classification 
(see Tables 6,1 ar]|i :6,2 of the previous chapter obtain the clasB- 



correlation fiiatrlx^ S^* Then compute the mean square difference 

between S and S^* Let t he r a nge ^of variation of .the mean square .. 

■ a; . B ' ' . 

differences be N= to Np, Let the range of variation of class 
' accuracy fbr these experiments, be A^ to^ A^* Then it is con Jeg tjy e_d 
^ that for a given aoqondary eluflsiflcatlort obtained by the. Bayeulan , I 
■^^'dlstarice algorithm 5 if the mean Bquarci difference butwoen G ' and. S, 
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lies within the range and -,Ng' then the classification acciui'kcy 
also lies apprpKimately within the range A arid A . " 

^ square differences for each of the secondary * 

\ . • ^" .• ■ ■ • " ^. , 

classification ohtained for ralues of ^ 0, 0*2, 0*5 and 0*8 is* 

/ . ■ , ■ _ . • •. ■ ^ ■ - 

2 given in liable 7^ a* ' Similarly the mean sqliare differences for a > 

set of foyr primary classification esqperiments with knovn acQUraey j 
v.of classification ia rgiven in Table T- 2** From 'Table 7 ^ if we see 
that ai the classification acci^acy deoreasqs'^he mean square 
difference increases, ranging from 0.00352 to ar00950. From'* 
Tab^e : 7*3 4t is seen that for values of S. - 012,^0.5, 0, 8 , the mean 
square differences' fpr secondaj*y claeaification lie within this 
range. ^ It is therefore con^jecttged that the clasiif icatlon aociirEgy 
for these^^ experiments liei between fil^ tp 79|.;\ii^ 1^ ^ 0, it is-' 
seen that the mean square difference is muoh higher and lies out site, 
the range a*^0352 0. 00950 • It is ^ therefore /conjec^^ed that the 
accuracy for this run is probably less than 60l. vali^ata these 
conjectwes^ anDther. experiment was cbnducteii . involving mandai 
telasiiflcation of documents into a secondary class, ■ 



Experiment 2 ' 

Two individuals well versed in .the area of physics were asked 
to read each of the 500 documents In the SPIN data base% . They weri 
apprised of prim^y-class asaigned by the AIP in each ease. Ti^ey 
vere aBked to assign a secoindary- class whenever they felt that such 
an aseigment woiad be appropriate. The resists obtained* for each of 
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■. \ - ■ ■■ ■ • 


., ,. ' . . •■' ■ ' .i, 
.-■ . ' " . . J , » 
' • ■ , . r ■ ■ 

' ' : . . O • ' , ■ : ^ 


• Table 7 ^.3 


Mian Square Differenae of /.= 
SeGondary Classification 




. MM SQUARE; DIFFERENCE 

BETTOEN S AND S ■ 

'a " b. .1 


■ 0.0 , 


0.01100^ 


.0.2 


; D, 00711 ■ ' 


0.5 . 


' ^ ^ . ; 0,00^77 ^ 


0.8- 


0,00373 ; - ; ' - 
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Table 7*1+ .*Hean Square Differ encfe for 
Primary Classification 



RIM 
NO, 


CLASSIFICATION 
ACCURACY . 


^ MM. SQUARE DIFPERMCE . 

BETWEEN S AND S 
* . a B 


1^ 


. 61.7^ ^ 


V 0, 00950 


2 


69,7% 


0,00669 


. 3 


77.9^ 


; o.ooii7ii 


\ 


. 79;i^ 


0.0p35a 
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' v'the' sec.phda^y classification expertoents. were then compared with 
their classification^resu^ fl'eteimine classteTication aQcuracjr. 



A ...document wae considered to be correctly classified in a Eecondary 
class , if this class matched that assighed by either^f the two 
individuals* Otherwise it ;was considered to he incorrectly classlfie 
Classification accuracy was 

\ ^ Let - \ . . - 



then computed as follows, 



Dt^.^ numher of documents classified hy the algoritJmi 

m nuniber of documents correctly classified. 

Then ^- ■ . . ^ v 

^ Acciiracy = x 100 . / 

The results are shown in Table 7.5. ■ . 



Table 7.5 Accuracy Based on Manual Classif ication 





TOTAL 
' ■ ' KUlffiER op" * 

DOCU^ffiNTSj . 
, / CLASSIPIEp ■ 


NUJffiEl^ OF 
DOCUMENTS • 
CORRECTLY 
CLASSIPIE5 


ACCURACY 


0 


211 


■■:/v.98 ' 


' k6M ■ ■ 


0.2 


163 


101 ' ' 


62.0^ 


0.5 


'llT , 




. ■ 63.2^ 




73' 




65.8^ 
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These- resiilts co:^o^orat8 the conjectures of Qcperiment 1 fairly 
well, it is seen that for the -values of - 0,2, 0,5 and 0.8 ^ the ^ 
classiflGation accuracy is hetter than 6o|, For p 0,; the. aeeuracy 
is substaiitially lower as had been predicted "by &periment\lV ' 

TM riesults of the above ej^erimenta show, t hit the Bayeslan- 
distance. technicLue can indeed be used effectively for secondary 
classification as well as for primary classification. 
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. This reBearch lias concentrated on the design of an fefficient 
method ifor automatic seq^uential classification of documents* It has 
heert pointed dta^t that one of the main advantages of suclj a; technique 
is that a document need not be ekamined In its entirety before a 

^deeision regardiilg its class memher ship can "be made* 

The basis of the riesearch has feeeh a sequential claseif icatibn 
algorithm developed by Fried and implemented for various data bases 
by Whilte; and cbworkera* In this t^ghnl^e keywords are e^raeted 
sequentially from a docip.aht and at feach stage a etatistlcal ^ ^ 
prediction technique is used to dpt ermine whether or not the 
document can be classified* If not ^ then the docimient is examined 

. f\;a'ther, ' The process is continued until a definite decision can^ 
be reached. It has been shown that this "basic sequential technique - 
Is vulnerable to the occurrence of noisy keywords and is not 
sophisticated enough to assign a document to more than one class 
systematically. Therefore the major part of this research has dtalt 
with the development of a modified sequential algorithm which is 

.able to isolate noisy keywords during claasif ication and to identify 
clusters bf siinilar keywords * These keyword clusters ajre thth/ 

. / , ' ;V 132 - ' 
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analyzed separately to obtain primary and secondary classes for a 
document - ; 

^ - The basis of the modif ied seg_uentlal techniq_ue is a . 
t^dmensionalV vector space representation of the keylfords contained 
in a document. It; has been shown that using such a repreaentatlon^-; 
the relationship betvepn th^ keywords can be observed very^ isystem- ' 
atically by defining a .distance measure oti this vector sbace, , Thii 
distance measure, tailed x^e ^Bayesian distances consists of two ' 
qpmpbnents 5 a magnitude '^a^ This research has shown., 

both experimentally and mathematically , that when a series. of 
keywords^ all of which Indicative of a unique.class , is processe 
the Jnagnitude of the Bayesian; distance increases tonotdnically and 
the direction remains constant. If, however ^ a noisy keyword occiirs 
the magnitude decreases or the direction changes , This interesting 
phenomenon has been utilised to effectively separate the keywords 
contained in a document into two gr^ups^ — the good keywords and tha- 
noisy keywords • The good ke^T^ords, all of which me in general 
indicative of a unique classy are then analyzed to identify a, 
primary class for the document. But^how about the group of noisy 
keywords? . ^ ; J 

This research has shown that the noisy keyvordSj which were 
identified as being nolay with respect to the primary class, can be 
utilized to explore the possibility of assigning a secondM^y c]^as 
to the document. The noisy keywords are analyzed, using the Bayeaian 
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distance memstu'e to ascertain whether they relate the document to 

an additional' class. I thig is eo then this additional class is 

identified as the second&ry olass* ■ ^ ^; 

Resiats obtained Vhy applying this automatiG sequential algorithm 

on a portion of the SPIN data base have shown that the technique can 

be quite succesBful in Identifying primary and secondary, classes for 

a document *c^An accuracy of about 80% has been obtained ■for primary 

' ' - " . ■ ■ ^ . ■ ■ ■ , . . ^ ^^^,r ; ■ - 

classification. For secondary elEssif Ication tjie accurafiy has been 
- " ;■■ ■ / ' ^ - . / • ■ . ■ ' 

better than 60% y E^erlmental evidence obtained by using class 

.correlation techniques and manual methods of classification has 
corroborated this ^esulti-^^ Considering the fact that the SPIN data ^ 
base contains oAly abstracts ffloBt of which have a very limited 
number of; keywpj'divp these resiilts appear to be quite encotoaging. 
It is expected that if full doquments had been used instead of ^ 
abstracts, then the secondary classifier would be. able= to examine 
a substantially greater nilmbar: of keywords before assigning 
secontory classes and hence would probably yield better resulta. 

: Several related areas of research can be identified at this 
stage. In the area of automatic document classification/ the 
problem of keyword selection is of paramount uiportance. Manual . 
and SCTil-automatic methods have generally been used to select 
keywords for a given data base. It wo^d^be very advantageous if 
there were a|i automAic method that could identliy bad or inappro- 
priate keywords in a set which is initially chosen manually or 



semi-autdmatidally . We have, shown that the Bayesian distance 
measure^ is an effective devidfe to detect the occiirr6nce of %oisy 
keywords in a dociiment during classification. Some of these noisy^ 
keywords are eventually used to obtain a secondary class. However^ 
those that do not give any information about eitiiet a priraary class 
or: a seqondary class can be Isolated as bad keywords and eliminated 
from the original list of keyv/ords* Hopefully^ this would result- 
in a keyword ^set whi'bh is more repres^tative of the data base. 

Another interesting problem would be to study the effect of 
the order in which the kejr^^ords are processed to obtain primary and 
secondary classification. It has been shown in ^lis research that 
if ail the keywords extra.c ted from a document: ar^ indicative of a 
unique class, then the order in which they are processed does not 
matter. However, if noisy keywords are present in a document, then 
in some cases the order of processing the ke^rwords may have 
detrimental effects on the final decision. This aspect needs 
further development and better ^nathematical characterisation. * 
Finally 5 a third problemi' which is much broader in scope, 
can be identified. It was pointed out in the Introduction that 
the main purpose of automatic dociament classification is to aid the 
process of information retrieval from a data base, : How can the 
Bayesian distance technique, be utilized for thrs pwpose? If the 

user queries could also be processed by means of this technique to^/ 

•■ .- - ^ - - ' 

obtain an indication of the various subject areas which might be 
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,releYant3 then the clasaif ication and retrieval aspects of an 
information system could be integrated. Since, however, q^ueries 
are , much -shorter in length^^^^^^t^ doeuments or ahstracts , the . 
Bayesian- distance technique would probahly have to he modified 
considerahly in order that the queries may be analyged effectively 



^ . APPENDIX -A 

. BAYESIAN DISTANCE AND CLASSIFICATION ERROR ' 

^ ■ ' ■ ■ - -■ . 

In Chapter IV it wae shown that for a two class problem the. 

pagnltude of the Bayesian distance forms a tight upper bound on the 

classlfieation error. In this appendix it will be shown that as 

the number of claeses inGreaseSs the quality of the Bayesliah 

distance as an approximating function. to the classif iQation error 

does not degrade yery much. 

Theorem ! 

. For an t-class problem after an observation i.ei^^g^flfrter i 
Keywords have been read ^ the upper bound on classification error 

given by 1- Mag(i)s 'does not exceed the value, of by more than 

I.e. ■ if I ^ {I5 2, * . . 3 t} then ^ 

• ■ [l-Mag(i)] - i V --^ ■ '■ 



or 



,2 t-1 



[max{p(C^/y)}l - 2 [p(C^/y) ]^ < 



rel ' . ^ seir ^ 



proof : 
• " Let 



[max{p(C /y)}] - p 

rel - 



If ^thi:aneguail%y i gatiBfied for the miniaTO yalue.of 

2 [p(C /y)]" then the thedrfm is pr^oven* ' Without loss of 
rel " , ^ .-^ ■• ' 

generality let us assme ■ ' 

Theh the minimum of I [p(C /y)] Is achiGved when the 

rel • ; 

%*emainihg a posteriori prohablllties are eq^ualg i*e*3 when 
p(C^/y^" ^ 1^ . r ^ a, 3, t, \ / 



There fore. 



max{p(C /y)} ^ E [p(C /y)]" 

sel 



Q achieTes its maximmi for some p; therefore taking the 
derivative of Q with respect to p and setting it equal to zero 
we have . . ^ 

dp ■ t - 1 t - 1 " 

, . ' - . .. - : ^- ■ - . . 

or; = .' . " 

t + 1 V . . 
^ • P = — ■ ■ 

Substituting this in Q we have^ . 
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= 139 



■ Since Q is. equal to this in the worst case, in genaral therefore 

Essentially 5 this theorem iaya that as the number of classes = 
^increasesj the upper "bound approaches a constant value. In terms 
of docment classificatiohj this theorem stipulates that the higher 
the final value of the Bayesian distance at^the time of 'claeslfl-' 
cation^ ' the more pitobable it is that thjb document will be correctly 
classified. 
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APPENDIX 



BAYESIAN pISTMCE AND a VALUES 



In Chapter IV it was claimed that if we have a set of a. 



values 



such that 



and 



- ' cxj' ^ -a^j^^ i ^ 2 to t - II ^ . 

then an increasa in will Irtcrease' the magnitude of the Bayesian ' 
distance. In this appendix this claim will be validated. 

Let be the D.^ magnitude obtained from [a^^ :a^^ , , 3 • ^ 

J- Jo 1 d . % 

Then . _ \^ _ \ ' ' • / \ 

: - . ' - - : - . : • 2 ■ • ' ^ ^ ' ^ 

■ . . IC..- E (a. ) . • • (B.l) 

■ ' . : i=i-^ - ■ ■ ^ ■ ■ '.^ ^ 

Since a. - a..., i * 2'to t-l, we e an write equation (B.l), as 

\ • ■ ■ : ' 2 . (i-mV; ^ 



iko 

A ^ 



iki. 
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■t - 1 



then 



JHow, if a^;'lB 4ncreasedi;;f^:j,;^ 6 we have the new set of 

a, values 

J ■ " ■ ■:■ • . . . . ; . V 

The value of the ne^ Dg magnitude, M^^ calculated by using this 
set of a.vElues will be a minimum if th^ a' through a* are all 
equal The^ the minimrai value of M^^ is given by 



Therefore ^ 



. ' ■ t'- 1' / '\ 

, = a' +b' 



= aa[a. - - ] + 26^ 

t-1 
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Since 



. it is trife that 



Therefore ^ 



Since; the minimum value ' of has been showri to be graa^ 

the olalm has been validated, ■ < , ' 
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. . . , ' APPENDIX C , = - .:• - 

« ■ ■ ■ . ' , ■ ■ i \ - . ^ ' ■ 

■ BAYESIAN DISTANaE AMD KKCtfORD VECTORS 

In this appendlXj Theorem 5*2 .stated and dlscuBsed in 
Chapter V will "be proven. 



.Theorem 5*2 ^ Giveh two keywordi ^ 

-■ ■■ ■ - - ■ . ■ ^ - ■ ■ - 

let be the magnitude of the Bayeelan distance calculated uaing 

and Mg he the Eiagnitude calaulated ijsing k. ^and 3^ . * Then 

M > M l "e ' , -V . S ' .y 



1 a 2 




f 2'^ 
r-l^^H 


'3 


— > 


^ 3 


JP-1 ■_ 




S p 
r-1 



CCD 



Proof: ficpariding the left hand side of the:iiiaq,uality (C.l) ve have 



pa 



2 2 


^22 


^ a 2 









(c.a) 



Assume that the values of p^, and are fix el. If it can 
"be showTi that the minimum value of F obtained varying g,^ , and 
mider the given eonitra^ts is greater than or equal to the right 
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hand side of (C/l), then the Inetuality is provenV^ 
• ' partial derivatives Of P with respaGt to q^, q.^ and q,^, simplifying ' 
and setting to zero we ottalni ' .. . ■ ; • ' 

• • ^ ' ^ ^ • ^ 

fj; . .. li^ ' Pl%^P3^3 "PlH^ Pa^2^P3^3 " 52%^ * 0 ■ / jCc .5) 

- ■ ' . : ■ / V-v. ' ^ 

From equations (C.3), (C,k) and (C.5) we, see^that a mininipi occuis 
when Piq.-L = ~ 53^3 » contradicts the •assumptionsl.thut ' . 

' ■ " Pl^-* *2%^"°^ * P3I3 ■ ' . • 

Thertfore wa have to check 1^ there ara^an^^ minimum points of 



Solving ^equatidn (C*U) for p^to^e have 



Pill + Po^o ^ 

/ i ' ■ \ . ' " ^ ~~ - . /■ . ■ ^ 

Substituting that in equation ( C.5) have . ^. 

... ' ^pM^^pM"*'^5i%P3%*p|%) , ' . 

; + ^4%^ -%'^^''^P3^3^^Pl^ *P3^3^ ' V ' • • ■ 

"/ U It a 2 2 2 i 1* , 
- ^Pl%'-2pJ'llP3«3"^P3'l35 " • ; ' ^ (CT) 
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Lajb x - Pj^q^f y - ^ equation (C,7) reduGes to / . ' 

x^y * j^^ - y - 0 ' ' ' ^, ,^ " ■ (C.8) 

•iyss.X^'is a eolutlon to equation (c.6-) ;,;,'but this implies ■ • 
^ -Ps^S'N,^^^^^ not possible let us I'dentl^r^he other 

roote of equation (C*8), Dividing it by (y-xj^-we obtain. ' 

= + xy?:+ - jc^ ^. 0 ■ ■ . -I^^^^ :; ^ (C.9)^ 

From DesQartee' rule of signs it is seen that equation (C, 9): ;^^^ 
; no more than one positive real root and no more than two ^negative . 
real roots, ' . > , / 

Substituting y:- z -S- in equation (G, 9). we obtain a reduced - 



cubic in 



) 



Equation (ClO) SjimplifieB to 



3 27 

:This is of the fom : "... \ '\ 

A . ■ ^ 3 ■ ■ 

: ^ ; y + ay + b ^ 0^ 

arid can be solved by Cardan' s^'inethoi* 

Substitute s - u%v; z is a root' if 



and 



uv ; ^ ^ : ■ ( C .13 ) 
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From equation (C*13) we have ' 



^ ^ 3u 



. .'Subitituting thlp in equation (0*12') we have , 

3 . ■ ■ . 

■• ' ■ ■ , ■ . ^ '-31^ ^ \ ' ^ ' ' ' 

It is a cube root of '-^{^la + (b? --i-)2} thpn the three roote 

■ - ■ ' i _ , ■ . ^ ■ ^ ' r ^ ' - .. .. ' 

of .equation (C.ll) are ' ^ 



where 3u^ and w ^ + i-iSi)' 



: This is of" the form ^ ; . - 

If these roots are" real and dletinct then R is less than ze^o. 
^f'or om* problem \ / ' '^^ . 



3 ; ■ : ':^t#^:?.2T-''; 



-3 



Therefore 



Proa this we -have 



Ikj 



Tiierefore the positive" real root is given hy 



„ - r'17 ._3 lT.g3i^ ,_3i3 . • rlT 3 ■ ' IT^isi' 3 1 3 



Therefore 



• ^ a. 631 X' 



Since y = z - ■- #e have 



i. e . 5 



p^q,^ = (0. 5^*369 )p^q^ 



#rom eq^uation (C.6) we obtain 



PgAg =.(0.83929 )p^q^ 



(C.15) 



(C.16) 



The minimum value of 'P is therefore glyen by. 
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^^A^ I *'i0.3k369)^ (0.83929)^ 



min ^ 7 .2 ' 



[1 + 0.5^69 + 0.83929]' 



which •reduces to 



^ 4„ =0.35220. 



min 



This Is. independent of ' (p^q.^). 



I'rom. equations (Qip) .andl tC,l6) ve obtain 



f J 0.83929 Hj^)ci^ ^ iCQ.i?) 



■2 



= (0.51*369) (^)q, - (C.18) 

" 3 ' 



The choice of p^^ and can be arbitrary^ but the -constraint ^ 
require that- 



and 



^2 ^1 



^3 < 



which imply (from equations (C a?) and\^^ 

Pg > 0,83929 > . ' (C.19) 
• ■ P3 > 0,5i*369P3^ . (C.20) 

At this point it le appropriate to recapitulate the^ original 
pjt-obUuu. ■ . ' 

■ ■ " ■ ' ■■ 2 2 2 ' " ' ^ ' ' . ' " ■ 

' • ; i^..^Pi * Pg 

We foiTO-Jhe, function G given by ■ 

. , • ■ - 0 = [fij^-Mg]^ : ■!> / (C.23) 

■■ 0 will be^ niuximized under several Inequal • 
dlscusarfd -later in the section^ In order to obtain the extreme 



^ " ■■ ■ ■ •■" "^^^ 

; points of these will be replaced by equalities in each case, 
If it can be shoTO that the maxlmmn value of G ie less than or 
equal to zero then the inequality (Cl) will have been proven, 
M£ is independent of q^^, q^ and so Mg has to be minimised with 
^ respect : to q^^ and q^* It has ^een shown that this is achieved ' 
^ ^when equations (Crf) and (C^'lS) are satisfied.'^ But cqnditi^^^ 

4^ And q^ < q^ require extra coristr These .ddnstrainti -'^^ 

are given by Inequalities C, 19) " and (C.20), Therefore under ' 
these constraints the maximuin value of is obtained when 

: \ Pg^ 0,5^369 p^ ^ ; {e.23) 

Thus the maximum value of Q vhen the oonsti'alnts given ty . - 
, inequalities «(C 49) and (C. 20) are. sati 

^/v^,:vr.. .- ; ; . ■ ■;g;= [j^-Mg] 

• . ■ -[0,35220- 0.35220] 

Therefore inequality (C.l) holds ^ i.e. ^ the theorem is proven wider ^ 
these constraints, ' ' ' 

Now^ it has tp be IhoTO that when constraints (0,19) and; 
(C,20) do not hold, inequality (C.iris still valid. Three cases : 
have to be considered. " . ; • ' 

(l)^GonBtraint pg > o;83929 pj_ is not satisfied, b-ut 
P2>0',5^369pj^ is eatlQfied. 

■ ■ ■ ' " -.' ' * . • . ■'■■■^ ■ ■ " r 
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(II) constraint p^'^y^O. 54369* is not iatiefiedi hut • 
pg > 0.83929 is satisfied* . 
/ ; (Ill) Neither of the two constraints is ^satisfied* " . / ' 

Case (I) • ■ ' ' : ' • -^.l^i'iyBW-^ : 



L v= ^ In this case in order to minimize M^^ q_- can be z^ade^ES-^ai^e-^ . 

' '"■ ■ ii E . f r " ".-■^■^rv^- ^ 

, as possible keeping In mind that . . ■ ■ / > 

has be satisfied* Therefpra ehoosf ftg ^ ®^ only 

can yary. It has^teen shpim (s0e;"4iqlLA.tion (C*5)^^ % . ^^.-'^^ 

■ ■ ■ ■ ■ ^ : •■ ■ V ■ 

■« implies ' ; . ' 

Since qg s from equation (C,26) wo have ' ' ■ 



r 2^ 2s ' 



[f the epnetr&int < q^ as tft'^b^ satiifled then • 



< 1 . . (C.27) 



This means 



Since p^ > pg > p^ 5 inequality (C,2B) is never satisfied, { Therefore 
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we must chobie •Ij^Q^ ^^.x order to minfiaize Mg. HenGe 

^2^ 



and- 



Therefore the theory ii true iri this case. 



•Gaae (II). ' ;„ ^" "u^ ; /' • ; ^ ... ■ :v.,: 

Pollowing the satoe argumenti given; in -Ga^^^ 

Then is F 16 minimiaea ■w'ith r'espest to under the conitralnt 



+ ..Par. 



Since > p|;ifc3p^^4|^ ^^is. poiUi (0.29) may be yi^r ''' 

aatlBfiedi ^ttusJatt^Q auboaBe's'-'may be identified, ' " ' v' 
gase-€I (a) ' ; ' ■ , / " ; ' ' 

2 . _ 2 



— ^ - ^ . ^ < 1^ then 

7 (p^ + PgK Pa 



to, mtniml^e F chooee 



and 



4o ^ 



2' 2 
(P1 + P3) . % 
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Then M^'is given "by ' ■ ' 



2 . a \ 



We. recall that 



M- » 



bince 



2x 2 
^1 



Ther^efore the theorem is also valid In thisJcaBei 
Case II (b)v ' ' - V 



and 



G ■« 0. 



Therefore the theorem holds in .this. ease. 



4 



,then we have, ,.to;':ehoppe,;£i^i,mq.g_^;^ so tha^-fi', F may. -be minimiM._,;..;Then 
•we have ■ . . ■ . '■ . ' 
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Caee 



' ^ Slnc^, neither ,Gpnetraints (C*19) or (C.20). is eatiafled 
^2 and q^ can be made as la^ga as pbseible under the constrainTb 



^1 * % - h 



TKet'efore for minlBim F we have 



Hence 



and 



Theref ore the theorem' has been shown, =to hold for all ,QaeeB, 



i = -.ft -J 
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■ i • . ' APPENDIX D ■ . . .. ■ 

The classification algorithm consists of the keyword .extraGtor 

.raodjile j the primary elasiifier 5 the. Biyesia^ distance calculator and / 

,tBe.;]n^^ In this appendix we will outline each of these . v. - 

modules so that this, algdr it may be implemented. The following^ ^ ' 

' entities need to toe diocussed for hetter underetanding. bf /the . ■ -■ \ ; V^^^- 
algorithnu ' . ... . ^ ' : ■ ^ . % ^ 

r v ' Q' iand S are, parameters discussed in/Chapter V; 
^ : \ and .a]fe; sfete cont^^ the ihdiqfs of the words in the 

. GBUP a^d the.-^lJBlIF. reafcestivei^ . u • V^i'-" ^^ . ' ^ = 

^"S^.is a set of ■ indices b'f^'tfeywor^s^'fQr./^hi 

■ distances ^have to be calculated^ during; in> call to th'fe-";"-^' - 'f^^^^:^ 
. Ba^esiari distance calculator module, it Isveciual to either- ^ ' 

^ ^ . MagClmJ and Dir[l;n] arivarrays^ of magnitudes and directions 
. , calculaAed ^y thf BayesiM -distmce calculator. ' ^ 

Keywo:^^ Extractor . . / ^ ! ■ . ■ . V ' L . ^ ' .r ; ^ ^ 

; ' r This :mQdul€ reads one k#i^^ the ^ ^ : ' ^ ' 

fallowing in; an input buffer r ; ' , - ^ ^ \ • 



■ " , . a) a value corresponding, tq the index of th§ "ikeyword;*- " ^ 

■ ■■ read; ' ' - - ^.-^ .^r-^J ■ . '> - 

■ b) the. keyword; e.nd L ^ / ■ ' . 

^" V c) the probabili-^ Values associate* w^th' the keywords.' ^ 
These are stored in the matrix PCisJ), where 1 is the > 

: . index of the keyvord and J. ^1 to t borrespond to the ^ 

' ' _ ■ . ^ t categories* - ' ' ■ . . 

■ Primary Ctosiiif iet^ - ^: ■ " ^ ■ ^ . . . ^ ■ ; . . ^ • 

\ ' ■ Thiu/mbdulje ,ariu33^^e - in 'the.' URUF to ■ obtain a prlmui*y 

yx: feia^ i'C^She ■ ihput^ consists of values; for :Q. and 13 .arid the - outinit" iu ; a. 
numeric code cjor^eapondihg to the SPIN' category whl oh 

assigned as the primary class, . v ■ " /V -- ; /V. ; V 

^ .- ■ ' ' . ^ ■ ' ■ > ■■ ■ ^ ."' . ' " .. ^ . . . 

,^ ' ^ ' 3* call keywQrd extractor ; i i + 1? if i < Q then sq / 
. c step. 2- ^ \ . \ ^ ' ' . ■■ \ , ' ' - \ 

'f* ^ ■ 5* . load iMlces 1 through Q In GBUF; : ■ , 

X . ^ ' 6* Sj ^ S^;: I ? oal'l Bayesian Dis-fiance Calculator* 

7- for to D^l, dheck to see wliuthur J4aa( J )'<Mag( J+l ) 

;/ and Dir(j)i^Dlr(j+l). If not then call Nolle betector.; 
else go to jtep 15* 
. N 8. J - index of noisy word; put j in the^ NBUPj 

' '^G'^^G ^N*^^^^' A. last element in S^; - ; 



n 



B last element -in ' . -^V- ' ^' " - ' 

if . |S^|^ then. S^^ S^i cal^l Ba^esian Dt Btaji.ee . V-^^ 



■ ' .. circulators^ el^e go t'd; step 6; ^ ■ 

11. - S_ * call Bayesian Distance Calciiator^ 



-12 1 Y - Mag(b) V if Y > X then interchange tfta^'eleniehts oC 
the GBlW^ MBUP%nd tte\ elements" of\S^ "and; S„ 



13 



reipectively ; ' 
ealllKey^o^d ExtraetOT; if document has rm Agre 
: keywords.. tKeri'';ident if y i as imc.lassifiableg 
14* ^ load index, of pew. ke}^ 'the 

So ^tb - step 6^/' 



h3 f if, ■ ftg (D ) > then . primary ' class : - Dir ( D el^ i go 



i Bayesian ^Distanc ^ Calcixlatgr ^' / : . . ,\ ^ ■ = 

'Giicen a: set .of Indieea of keywords^ this module calculatee 
the magnitudes^vand directib of^D^^ a^d etores -them^ ill the rartayS i 
MiLg[lm] aiid Di5^;^;]"re^^ The input tp t^is moivJ-Q is a. 

matrix .REn^t ] ponta^ning! the^ prohahility values associated wi^^^ 
ithe keywords vhose iridicfs are IriiSi 



1* M^nsjh 

.g*. i^ 1; ^ 



EKLC 
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(5. 



If 1 a i then for J - 1 to tj/aet /' 

. PBOBd-^J-) ^= P(i,J) ; else for J =L to t,. set 
' PBOB(i,j), = PROBti-l,^)'*Tfi,j)r - 

PSTAB ^= S PROB(ipj); ,! . ^ 

.for, J ^ I t c ompiite 



6. Mag(l).r S AL^Cj);. ■ 

T*; Dir(i) ^ ^4ex of the largest ^alue of 
8, 1 ^ j + i^ if ,i>N then return; else go to Btap/S* 



Noise 'Detector _ ^ > . ^ ; 

- - "— — - -— . 

This inodule compares the magnitiides and directions of the 
key¥ords in the GBUF and detects -a noisy TOrd* It returriB the^ 
index of this "word, 

2. ^or 1^1 to W-l check whether Mag( i) > Mag(i + 1) , . 
^ If so then ret (i + l) as the. index of the noiay 

. ' . = ° \ . ' ' ' ' ' . - . . ■-' ^ ' ' ' 

/. keywrdV-- ■ ' ^ ■ ' \ - 

3. for i^l to.N-1, check vhetH^^DiHi) " Dir(i + I)* 
If . n<yt then retwri (i + l) aSi.th§;Srtdex of tht-'ndlsy 

V . ■ ■ ■ _ w ^ . ' ; ■ _ 

' ■ . ■ keyword,. ^ \f - - . . - r ' 
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